Speaker diarization (partitioning audio based on speaker identity) #104

dehma01 · 2022-09-24T14:13:51Z

dehma01
Sep 24, 2022

Is there a simple way to do speaker diarization with whisper ?

omenocal · 2022-09-28T16:01:30Z

omenocal
Sep 28, 2022

I'm also interested in the speaker diarization feature.

6 replies

TejaswiniiB Sep 29, 2022

Me too

HeadStudios May 11, 2023

Yep same here, trying with Hugging Face but it's not working

actuallyrizzn May 30, 2023

Yeah, the system requirements for it are pretty nuts.

HeadStudios May 30, 2023

I know right! By the way I started using an API for it which gives 3 hours free - I do not remember the name off the top of my head but it just seems so much easier than what Hugging Face requires. Let me know if you need the direct URL and I can look it up for you - but I imagine if you search 'diarization api' it should come up. Chrs!

NavodPeiris Jan 19, 2024

@omenocal @ejkitchen @TejaswiniiB @HeadStudios @actuallyrizzn

checkout this python package created by me: https://pypi.org/project/speechlib/
this is github repo: https://github.com/Navodplayer1/speechlib

you can do transcription, speaker diarization and speaker recognition all together and get a transcript with actual speaker names!

ShantanuNair · 2022-09-29T09:56:16Z

ShantanuNair
Sep 29, 2022

A common approach to accomplish diarization is to first creating embeddings (think vocal features fingerprints) for each speech segment (think a chunk of speech based you obtain from the timestamps. Something like 10:00s -> 13.52s would be a segment), and then clustering the embeddings based on that, so you know which speech segments can be grouped and assigned a speaker.

I have only glanced through the paper, and will look into it more in depth, but the key focus would be to see if Whisper can create such embeddings. My gut feels this would be unlikely, as the model focuses more on what is said than how something is said.

1 reply

NavodPeiris Jan 19, 2024

checkout this python package created by me: https://pypi.org/project/speechlib/
this is github repo: https://github.com/Navodplayer1/speechlib

you can do transcription, speaker diarization and speaker recognition all together and get a transcript with actual speaker names!

lmmx · 2022-09-29T11:37:41Z

lmmx
Sep 29, 2022

Example script using pyannote (unrelated software project) for diarisation, grouping consecutive segments by the same speaker into a 'rolled up' dataframe rather than every segment pyannote finds: https://gist.github.com/lmmx/0970a01295e12531f6a3f0ac5537e0b8

Speaker verification example here, using SpeechBrain: https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/speaker_verification.ipynb

Keen to develop a solution that matches up to Whisper here, and I expect @ShantanuNair's answer is better (and would look more like the speaker verification script).

My gut feels this would be unlikely, as the model focuses more on what is said than how something is said.

I agree, I don't think it'd work with Whisper's output as I've seen it group multiple speakers into a single caption.

(Unfortunately I've seen that putting whisper and pyannote in a single environment leads to a bit of a clash between overlapping dependency versions, namely HuggingFace Hub)

4 replies

octimot Oct 2, 2022

(Unfortunately I've seen that putting whisper and pyannote in a single environment leads to a bit of a clash between overlapping dependency versions, namely HuggingFace Hub)

I can confirm.

I tried to still use pyannote with huggingface-hub==0.10.0 (which works fine for whisper) but I get an exception.

Majdoddin Oct 7, 2022

See here for a workaround:
#264 (reply in thread)

NavodPeiris Jan 19, 2024

checkout this python package created by me: https://pypi.org/project/speechlib/
this is github repo: https://github.com/Navodplayer1/speechlib

you can do transcription, speaker diarization and speaker recognition all together and get a transcript with actual speaker names!

octimot Jan 19, 2024

@Navodplayer1 Please stop sending messages on all threads. We all get all the replies you've sent since we're subscribed to the discussion.

Thanks

octimot · 2022-10-24T15:55:37Z

octimot
Oct 24, 2022

After digging a bit more on this subject, it feels that timed speaker diarization is only achievable using a hybrid approach.

Using Pyannote (see Majdoddin's work) seems to be a good and fast solution, but adding silences to audio files that will be later fed to Whisper might generate unwanted hallucinations and influence the context of the transcription, especially for non-english transcripts.

jongwook suggested in another discussion to "do a crude form of speaker turn tracking" by using the prompt attribute in the transcription call and inserting a sort of dialogue example to make the model follow the lead (e.g. " - Hey how are you doing? - I'm doing good. How are you?")

Possible approach:

first transcribe the original audio using Whisper with initial_prompt="- Hey how are you doing? - I'm doing good. How are you?")
use pyannote on the same original audio to detect the speakers
use pyannote's results (timings) to check the segments generated by whisper and confirm the speakers
then finally provide a report with the pyannote timings that weren't paired with any whisper segments

5 replies

Majdoddin Oct 26, 2022

Meanwhile, I have updated my work, no more silence adding. Check the updated Colab notebook and the result.
And, please consider supporting my work with a donation.

sonovice Dec 5, 2022

I tried your approach, @octimot. It works perfectly, thanks.

hkaraoguz Dec 21, 2022

I used the proposed approach in this video https://www.youtube.com/watch?v=dUcVcxfk-gc but I couldn't get a good segmentation from whisper using initial_prompt="- Hey how are you doing? - I'm doing good. How are you?" The segments given by whisper without using initial_prompt looked more promising to match with pyannote segments. Is the given initial_prompt generic or scenario specific? Thank you. (I used the base model with whisper 1.0 installed)

karim-alweheshy Apr 9, 2023

I found better results when the initial_prompt was relevant to the situation of the audio track

NavodPeiris Jan 19, 2024

checkout this python package created by me: https://pypi.org/project/speechlib/
this is github repo: https://github.com/Navodplayer1/speechlib

you can do transcription, speaker diarization and speaker recognition all together and get a transcript with actual speaker names!

NunoSempere · 2022-12-20T23:52:37Z

nhtkid · 2023-01-12T05:53:44Z

nhtkid
Jan 12, 2023

Example script using pyannote (unrelated software project) for diarisation, grouping consecutive segments by the same speaker into a 'rolled up' dataframe rather than every segment pyannote finds: https://gist.github.com/lmmx/0970a01295e12531f6a3f0ac5537e0b8

Speaker verification example here, using SpeechBrain: https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/speaker_verification.ipynb

Keen to develop a solution that matches up to Whisper here, and I expect @ShantanuNair's answer is better (and would look more like the speaker verification script).

My gut feels this would be unlikely, as the model focuses more on what is said than how something is said.

I agree, I don't think it'd work with Whisper's output as I've seen it group multiple speakers into a single caption.

(Unfortunately I've seen that putting whisper and pyannote in a single environment leads to a bit of a clash between overlapping dependency versions, namely HuggingFace Hub)

Hey, does that mean you are just gonna use Pyannote to do the transcription instead of Whisper? So it will be no OpenAI involved?

0 replies

Majdoddin · 2023-07-21T05:22:02Z

Majdoddin
Jul 21, 2023

www.lexicaps.com seamlessly adds diarization to Whispers transcription. No 3rd party packages.
Announcement: #1537
Repo: https://github.com/Majdoddin/lexicaps

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker diarization (partitioning audio based on speaker identity) #104

{{title}}

Replies: 7 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Speaker diarization (partitioning audio based on speaker identity) #104

Replies: 7 comments · 18 replies

Possible approach:

Replies: 7 comments 18 replies