🚀 Introducing Whipser-AT, a new joint audio tagging and speech recognition model. #1504

YuanGongND · 2023-07-07T03:12:02Z

YuanGongND
Jul 7, 2023

[Paper]
[HuggingFace Space] (Try Whisper-AT without Coding!)
[Colab Demo]
[Source Code]

We are glad to introduce Whisper-AT - A new joint audio tagging and speech recognition model. It outputs background sound labels in addition to text.

Key features:

Whisper-AT inherits all APIs of Whisper, as well as its ASR performance. You only need to change your code minimally and can get the same output as the original Whisper.
In addition to that, Whisper-AT outputs audio event tasks of 527 classes (AudioSet ontology), at your desired time resolution. The audio tagging performance is close to SOTA standalone audio tagging model.
No additional dependencies, only < 1% extra computational cost compared to the original Whisper - If your device can run the original Whisper, it can also run Whisper-AT.
You can custom Whisper-AT (e.g., set the threshold, and classes of interest) easily. Multi-lingual supported, audio taggings follow ASR language by default.
If you are interested in the research part, our paper shows while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually NOT noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type.

See the demo (Please turn on the audio to listen to the sounds):

cooking.mp4

Code and pretrained model are released at [here]. Please have a try and let us know what you think!

dgoryeo · 2023-07-08T10:29:01Z

dgoryeo
Jul 8, 2023

Thanks @YuanGongND . Is it possible to get the audio tags already when using the tiny model? Do you expect any differences in results of the audio tag between large and tiny models, say for Japanese language?

0 replies

YuanGongND · 2023-07-08T19:08:01Z

YuanGongND
Jul 8, 2023
Author

hi @dgoryeo ,

Thanks so much for your interest! Yes, Whisper-AT definitely supports tiny models, in fact, it supports all Whisper models.

Regarding the performance (see column AS mAP, the higher, the better):

Model
Name #ASR
Params Language #AT Params
(TL-TR) AS mAP
(TL-TR) #AT Params
(TL-TR-512) AS mAP
(TL-TR-512)

large-v2
(large) 1550M Multilingual 40.0M 41.7 7.2M 40.3

large-v1 1550M Multilingual 40.0M 42.1 7.2M 41.6

medium.en 769M English 25.8M 41.4 7.1M 41.1

medium 769M Multilingual 25.8M 40.8 7.1M 41.2

small.en 244M English 14.6M 40.1 6.9M 39.9

small 244M Multilingual 14.6M 39.8 6.9M 39.8

base.en 74M English 6.6M 37.5 - -

base 74M Multilingual 6.6M 37.6 - -

tiny.en 39M English 3.8M 35.8 - -

tiny 39M Multilingual 3.8M 36.5 - -

Abbreviations:

#ASR Params = Model parameters for the automatic speech recognition task.

#AT Params = Model parameters for audio tagging.

TL-TR = The proposed time and layer-wise Transformer model, the dimension follows the Whisper model, e.g., 1280 for whisper-large.

TL-TR-512 = The proposed low-computational time and layer-wise Transformer model, the dimension is projected to 512, not available for base and small models that have lower dimensions than 512.

AS mAP = The audio tagging mean average precision (mAP) on the AudioSet evaluation set.

You can see that smaller models also have a weaker Audio Tagging performance, but the performance is still reasonably good!

Practically, you can use the following code for a quick try:

pip install whisper-at

and then

import whisper_at as whisper

audio_tagging_time_resolution = 10
model = whisper.load_model("tiny")
result = model.transcribe("audio.mp3", at_time_res=audio_tagging_time_resolution)
# ASR Results
print(result["text"])
# Audio Tagging Results
audio_tag_result = whisper.parse_at_label(result, language='follow_asr', top_k=5, p_threshold=-1, include_class_list=list(range(527)))
print(audio_tag_result)

P.S. Whisper-AT has a nice feature that by default the language of audio tag names follows ASR language, e.g., if your ASR output is Japanese, then the audio tags are also in Japanese.

For more details, please check [Colab Demo] and our [Github Repo].

Cheers,
Yuan

2 replies

dgoryeo Jul 10, 2023

Hi @YuanGongND , Thank you so much for detailed response.

The usecase I have in mind is like this: I have noticed that for transcription of long form audios (larger than 30min), the results are much better if the long audio is broken to "scenes/chapters" and then each "scene/chapter" is transcribed by appropriate hyperparameters. The hyperparameters can be determind better if each scene is categorised into noise, music, overlapping talk, silence, etc. I was wondering to use a 2-pass approach with whisper-AT. One pass with tiny model to identify the audio tags, and second pass with larger model for transcription of each scene based on the tags for each scene.

I was wondering if that would make sense as an approach.

YuanGongND Sep 14, 2023
Author

Sorry for the late reply.

I think two-pass might be a valid solution to make the classification from coarse-grained to fine-grained. But we haven't done any experiment with this.

-Yuan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Introducing Whipser-AT, a new joint audio tagging and speech recognition model. #1504

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

🚀 Introducing Whipser-AT, a new joint audio tagging and speech recognition model. #1504

YuanGongND Jul 7, 2023

Replies: 2 comments · 2 replies

dgoryeo Jul 8, 2023

YuanGongND Jul 8, 2023 Author

dgoryeo Jul 10, 2023

YuanGongND Sep 14, 2023 Author

YuanGongND
Jul 7, 2023

Replies: 2 comments 2 replies

dgoryeo
Jul 8, 2023

YuanGongND
Jul 8, 2023
Author

YuanGongND Sep 14, 2023
Author