How to Use Temporal Pooling Layer? #5

Yunlei-AI · 2023-08-02T03:59:39Z

I use “time_pooling = nn.AvgPool2d((60,1))” for whisper large pre-trained model（encoder out size is [batch,1500,1280]）as Temporal Pooling Layer, but for 'last_mlp' and 'last_tr' methods cannot achieve the accuracy mentioned in the paper in my test on the ESC50 dataset. So I would like to ask if my settings are correct.
The detailed code is as follows：

def forward(self, x):
outputs = self.model.encoder(x) #large:[bs,1500,1280]
outputs = self.time_pooling(outputs) #large:[bs,25,1280]
outputs = self.time_tr(outputs) #Temporal transformer
outputs = torch.mean(outputs, dim=1) #large:[bs,1280]
logits = self.mlp_layer(outputs) #large:[bs,50]

Another question： What is "Linear Projection" mean? is it "nn.Linear" ? I didn't find this part in the 'models.py' file in your release code

YuanGongND · 2023-08-02T04:47:20Z

I assume you meant:

whisper-at/src/whisper_at_train/models.py

Lines 113 to 167 in fc32775

    
           if self.mode == 'mean_mlp': 
        
               audio_rep = torch.mean(audio_rep, dim=1) 
        
               audio_rep = torch.mean(audio_rep, dim=1) 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) 
        
           elif self.mode == 'last_mlp': 
        
               audio_rep = audio_rep[:, -1, :, :] # get the last layer 
        
               audio_rep = torch.mean(audio_rep, dim=1) 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) 
        
           elif self.mode == 'wa_mlp': 
        
               audio_rep = torch.mean(audio_rep, dim=2) # [B, 32 1280] 
        
               audio_rep = torch.permute(audio_rep, (0, 2, 1)) # (B, 1280, 32) 
        
               audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum() 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) 
        
           elif 'mean_tr' in self.mode: 
        
               audio_rep = torch.mean(audio_rep, dim=1) # [B, 25, 1280] 
        
               audio_rep = self.time_tr(audio_rep) # [B, 25, 1280] 
        
               audio_rep = torch.mean(audio_rep, dim=1)  # [B*32, 1280] 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) time transformer on the last layer representation 
        
           elif 'last_tr' in self.mode: 
        
               audio_rep = audio_rep[:, -1, :, :]  # [B, 25, 1280] 
        
               audio_rep = self.time_tr(audio_rep) # [B, 25, 1280] 
        
               audio_rep = torch.mean(audio_rep, dim=1)  # [B*32, 1280] 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) time transformer on the layer-wise weight-average representation 
        
           elif 'wa_tr' in self.mode: 
        
               audio_rep = torch.permute(audio_rep, (0, 2, 3, 1)) # (B, 25, 1280, 32) 
        
               audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum() # [B, 25, 1280] 
        
               audio_rep = self.time_tr(audio_rep) # [B, 25, 1280] 
        
               audio_rep = torch.mean(audio_rep, dim=1)  # [B*25, 1280] 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) weight average with low-dimension projection 
        
           elif 'wa_down_tr' in self.mode: 
        
               audio_rep = torch.permute(audio_rep, (0, 2, 3, 1)) # (B, 25, 1280, 32) 
        
               audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum() # [B, 25, 1280] 
        
               audio_rep = self.down_layer(audio_rep) 
        
               audio_rep = self.time_tr(audio_rep) # [B, 25, 1280] 
        
               audio_rep = torch.mean(audio_rep, dim=1)  # [B*32, 1280] 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep

What is your accuracy on ESC-50? We did test on Whisper large-v1, not large (which is actually large-v2). We also notice Whsiper feature has slightly different values when computed on different GPUs. So if it is just 1-2 points, it is reasonable to me.

Which temporal pooling layer do you mean?

-Yuan

Yunlei-AI · 2023-08-02T08:21:38Z

Thank you for your quick reply.

I use the "last_mlp" method in 'whisper-at/src/whisper_at_train/models.py' and test accuracy is 82% on the original Whisper large, not large-v1 or large-v2. I will try large-v1. I appreciate your reminder.

For the temporal pooling in your paper Figure 4: the last whisper encoder layer output feature shape is [bs,500,1280]', and then through temporal pooling, it becomes [bs,25,1280]. Have you used ‘torch.nn.AvgPool2d’? I did not use your release of the ESC-50 features. So I don't know whether I am using AvgPool2d right or not.

-YunFei

YuanGongND · 2023-08-02T08:28:25Z

hi there,

Sorry, I have a hard deadline due this week, so need to follow up on this later.

We released the code for actual feature extraction.

whisper-at/src/noise_robust_asr/intermediate_feat_extract/whisper_feat_extracrt/whisper/model.py

Lines 194 to 201 in fc32775

    
           elif o_layer == 'all_nopool': 
        
               all_x = [] 
        
               for block in self.blocks: 
        
                   all_x.append(x) 
        
                   x = block(x) # x in shape [B, audio_len, feat_dim], after pooling in shape [B, feat_dim] 
        
               all_x.append(x) 
        
               all_x = torch.stack(all_x, dim=3) # should be in shape (B, audio_len, feat_dim, num_layer) 
        
               return all_x

Note please change the hard-coded 1000 to 500 for ESC-50:

whisper-at/src/noise_robust_asr/intermediate_feat_extract/whisper_feat_extracrt/whisper/transcribe.py

Lines 45 to 46 in fc32775

    
           # TODO: 1000 is for AudioSet, 500 is for ESC-50 
        
           mel = pad_or_trim(mel, 1000).to(model.device).to(dtype)

The actual pooling layer (note 10x pooling for ESC, 20x for AudioSet, as ESC is 5s and AS is 10s, the output should be both 25 in time)

https://github.com/YuanGongND/whisper-at/blob/main/src/noise_robust_asr/intermediate_feat_extract/esc-50/extract_esc50_whisper_all_pool.py

When you call large, you actually call large-v2, you need to specify v1. But still I felt 82% is bit low, we reported 87% in the paper, right? It is a standard 5-fold cross-validation.

-Yuan

YuanGongND added the question Further information is requested label Aug 2, 2023

Yunlei-AI closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2023

Yunlei-AI reopened this Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Use Temporal Pooling Layer? #5

How to Use Temporal Pooling Layer? #5

Yunlei-AI commented Aug 2, 2023 •

edited

Loading

YuanGongND commented Aug 2, 2023

Yunlei-AI commented Aug 2, 2023 •

edited

Loading

YuanGongND commented Aug 2, 2023

How to Use Temporal Pooling Layer? #5

How to Use Temporal Pooling Layer? #5

Comments

Yunlei-AI commented Aug 2, 2023 • edited Loading

YuanGongND commented Aug 2, 2023

Yunlei-AI commented Aug 2, 2023 • edited Loading

YuanGongND commented Aug 2, 2023

Yunlei-AI commented Aug 2, 2023 •

edited

Loading

Yunlei-AI commented Aug 2, 2023 •

edited

Loading