Replies: 7 comments 18 replies
-
I'm also interested in the speaker diarization feature. |
Beta Was this translation helpful? Give feedback.
-
A common approach to accomplish diarization is to first creating embeddings (think vocal features fingerprints) for each speech segment (think a chunk of speech based you obtain from the timestamps. Something like 10:00s -> 13.52s would be a segment), and then clustering the embeddings based on that, so you know which speech segments can be grouped and assigned a speaker. I have only glanced through the paper, and will look into it more in depth, but the key focus would be to see if Whisper can create such embeddings. My gut feels this would be unlikely, as the model focuses more on what is said than how something is said. |
Beta Was this translation helpful? Give feedback.
-
Example script using pyannote (unrelated software project) for diarisation, grouping consecutive segments by the same speaker into a 'rolled up' dataframe rather than every segment pyannote finds: https://gist.github.com/lmmx/0970a01295e12531f6a3f0ac5537e0b8 Speaker verification example here, using SpeechBrain: https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/speaker_verification.ipynb Keen to develop a solution that matches up to Whisper here, and I expect @ShantanuNair's answer is better (and would look more like the speaker verification script).
I agree, I don't think it'd work with Whisper's output as I've seen it group multiple speakers into a single caption. (Unfortunately I've seen that putting whisper and pyannote in a single environment leads to a bit of a clash between overlapping dependency versions, namely HuggingFace Hub) |
Beta Was this translation helpful? Give feedback.
-
After digging a bit more on this subject, it feels that timed speaker diarization is only achievable using a hybrid approach. Using Pyannote (see Majdoddin's work) seems to be a good and fast solution, but adding silences to audio files that will be later fed to Whisper might generate unwanted hallucinations and influence the context of the transcription, especially for non-english transcripts. jongwook suggested in another discussion to "do a crude form of speaker turn tracking" by using the Possible approach:
|
Beta Was this translation helpful? Give feedback.
-
Any chance that this diarization could be implemented as a standalone package, rather than having to rely on Google Colab? |
Beta Was this translation helpful? Give feedback.
-
Hey, does that mean you are just gonna use Pyannote to do the transcription instead of Whisper? So it will be no OpenAI involved? |
Beta Was this translation helpful? Give feedback.
-
www.lexicaps.com seamlessly adds diarization to Whispers transcription. No 3rd party packages. |
Beta Was this translation helpful? Give feedback.
-
Is there a simple way to do speaker diarization with whisper ?
Beta Was this translation helpful? Give feedback.
All reactions