Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Use Temporal Pooling Layer? #5

Open
Yunlei-AI opened this issue Aug 2, 2023 · 3 comments
Open

How to Use Temporal Pooling Layer? #5

Yunlei-AI opened this issue Aug 2, 2023 · 3 comments
Labels
question Further information is requested

Comments

@Yunlei-AI
Copy link

Yunlei-AI commented Aug 2, 2023

I use “time_pooling = nn.AvgPool2d((60,1))” for whisper large pre-trained model(encoder out size is [batch,1500,1280])as Temporal Pooling Layer, but for 'last_mlp' and 'last_tr' methods cannot achieve the accuracy mentioned in the paper in my test on the ESC50 dataset. So I would like to ask if my settings are correct.
The detailed code is as follows:

def forward(self, x):
outputs = self.model.encoder(x) #large:[bs,1500,1280]
outputs = self.time_pooling(outputs) #large:[bs,25,1280]
outputs = self.time_tr(outputs) #Temporal transformer
outputs = torch.mean(outputs, dim=1) #large:[bs,1280]
logits = self.mlp_layer(outputs) #large:[bs,50]

Another question: What is "Linear Projection" mean? is it "nn.Linear" ? I didn't find this part in the 'models.py' file in your release code

@YuanGongND YuanGongND added the question Further information is requested label Aug 2, 2023
@YuanGongND
Copy link
Owner

I assume you meant:

if self.mode == 'mean_mlp':
audio_rep = torch.mean(audio_rep, dim=1)
audio_rep = torch.mean(audio_rep, dim=1)
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline)
elif self.mode == 'last_mlp':
audio_rep = audio_rep[:, -1, :, :] # get the last layer
audio_rep = torch.mean(audio_rep, dim=1)
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline)
elif self.mode == 'wa_mlp':
audio_rep = torch.mean(audio_rep, dim=2) # [B, 32 1280]
audio_rep = torch.permute(audio_rep, (0, 2, 1)) # (B, 1280, 32)
audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum()
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline)
elif 'mean_tr' in self.mode:
audio_rep = torch.mean(audio_rep, dim=1) # [B, 25, 1280]
audio_rep = self.time_tr(audio_rep) # [B, 25, 1280]
audio_rep = torch.mean(audio_rep, dim=1) # [B*32, 1280]
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline) time transformer on the last layer representation
elif 'last_tr' in self.mode:
audio_rep = audio_rep[:, -1, :, :] # [B, 25, 1280]
audio_rep = self.time_tr(audio_rep) # [B, 25, 1280]
audio_rep = torch.mean(audio_rep, dim=1) # [B*32, 1280]
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline) time transformer on the layer-wise weight-average representation
elif 'wa_tr' in self.mode:
audio_rep = torch.permute(audio_rep, (0, 2, 3, 1)) # (B, 25, 1280, 32)
audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum() # [B, 25, 1280]
audio_rep = self.time_tr(audio_rep) # [B, 25, 1280]
audio_rep = torch.mean(audio_rep, dim=1) # [B*25, 1280]
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline) weight average with low-dimension projection
elif 'wa_down_tr' in self.mode:
audio_rep = torch.permute(audio_rep, (0, 2, 3, 1)) # (B, 25, 1280, 32)
audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum() # [B, 25, 1280]
audio_rep = self.down_layer(audio_rep)
audio_rep = self.time_tr(audio_rep) # [B, 25, 1280]
audio_rep = torch.mean(audio_rep, dim=1) # [B*32, 1280]
audio_rep = self.mlp_layer(audio_rep)
return audio_rep

What is your accuracy on ESC-50? We did test on Whisper large-v1, not large (which is actually large-v2). We also notice Whsiper feature has slightly different values when computed on different GPUs. So if it is just 1-2 points, it is reasonable to me.

Which temporal pooling layer do you mean?

-Yuan

@Yunlei-AI
Copy link
Author

Yunlei-AI commented Aug 2, 2023

Thank you for your quick reply.

I use the "last_mlp" method in 'whisper-at/src/whisper_at_train/models.py' and test accuracy is 82% on the original Whisper large, not large-v1 or large-v2. I will try large-v1. I appreciate your reminder.

For the temporal pooling in your paper Figure 4: the last whisper encoder layer output feature shape is [bs,500,1280]', and then through temporal pooling, it becomes [bs,25,1280]. Have you used ‘torch.nn.AvgPool2d’? I did not use your release of the ESC-50 features. So I don't know whether I am using AvgPool2d right or not.

-YunFei

@Yunlei-AI Yunlei-AI closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2023
@Yunlei-AI Yunlei-AI reopened this Aug 2, 2023
@YuanGongND
Copy link
Owner

hi there,

Sorry, I have a hard deadline due this week, so need to follow up on this later.

We released the code for actual feature extraction.

elif o_layer == 'all_nopool':
all_x = []
for block in self.blocks:
all_x.append(x)
x = block(x) # x in shape [B, audio_len, feat_dim], after pooling in shape [B, feat_dim]
all_x.append(x)
all_x = torch.stack(all_x, dim=3) # should be in shape (B, audio_len, feat_dim, num_layer)
return all_x

Note please change the hard-coded 1000 to 500 for ESC-50:

# TODO: 1000 is for AudioSet, 500 is for ESC-50
mel = pad_or_trim(mel, 1000).to(model.device).to(dtype)

The actual pooling layer (note 10x pooling for ESC, 20x for AudioSet, as ESC is 5s and AS is 10s, the output should be both 25 in time)

https://github.com/YuanGongND/whisper-at/blob/main/src/noise_robust_asr/intermediate_feat_extract/esc-50/extract_esc50_whisper_all_pool.py

When you call large, you actually call large-v2, you need to specify v1. But still I felt 82% is bit low, we reported 87% in the paper, right? It is a standard 5-fold cross-validation.

-Yuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants