Detecting Syllable / Phonetic Timestamps #762

Akz47 · 2024-03-31T09:23:15Z

May I know if there is a way to use WhisperX to generate timestamps of syllables or phonemes, instead of words detected Whisper model?

Our use case is to detect pronunciations / syllables in audio recordings, and sometimes words are not properly detected / omitted by Whisper (even for large models).

It would be helpful if we could obtain the syllable / phonetic timestamps, even if it is not a recognized word.

Thank you.

SmartManoj · 2024-04-01T10:32:29Z

To give you the most relevant advice, please explain your primary use case for detecting syllables or phonemes with WhisperX. For example, are you focusing on language learning, speech therapy, or another area? I want you to know that understanding your main objective will help us avoid the XY problem and offer more targeted assistance.

Thank you.

Akz47 · 2024-04-01T10:55:54Z

@SmartManoj Thank you for your reply.

We are experimenting with pronunciation improvement, detecting and analyzing how people pronounce syllables / phonemes. These sounds may not constitute a complete / actual word, and Whisper seems to output only recognized words.

For our use case, it might not be necessary to identify the spoken word, but rather, the syllables and corresponding timestamps.

We have also been scouting the web for alternative Python modules that can split speech audio into syllables, but have not come across a suitable option.

Can you please advise? Thanks.

SmartManoj · 2024-04-01T11:07:57Z

You might find the My-Voice Analysis library useful for you. It's a Python library developed for voice analysis that can detect syllable boundaries in audio files without needing transcription. Here's the GitHub repository for more information and usage instructions: My-Voice Analysis.

Akz47 · 2024-04-01T12:13:05Z

@SmartManoj Thank you, that reference is really helpful. We will experiment with that module to generate the syllables and timestamps.

Just to reconfirm, can Whisper / Whisperx also be tweaked to detect syllables, or it only works with proper words?

hollarob · 2024-04-09T15:13:03Z

I've also would be interested in phoneme-based timestamps.

jonaaathan · 2024-05-27T08:32:36Z

we would also be interested in it , if it can be run locally and delivered the phonemes predictions similar to azure API

kingjr · 2024-07-24T17:14:24Z

Also interested in getting phoneme level time stamps (for neuroscience of speech research)

TasseDeCafe · 2024-11-22T20:39:56Z

@kingjr have you been able to find something?

diyism · 2024-12-20T05:09:52Z

I've tested 3 projects:

thetaOscillator-syllable-segmentation:
[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting k2-fsa/sherpa-onnx#920 (comment)
allosaurus:
[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting k2-fsa/sherpa-onnx#920 (comment)
pyannote_segment_syllables:
Add pyannote vad (segmentation) model k2-fsa/sherpa-onnx#1197 (comment)

I think the 3rd is the best up to today, but it's not perfect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting Syllable / Phonetic Timestamps #762

Detecting Syllable / Phonetic Timestamps #762

Akz47 commented Mar 31, 2024

SmartManoj commented Apr 1, 2024

Akz47 commented Apr 1, 2024

SmartManoj commented Apr 1, 2024

Akz47 commented Apr 1, 2024

hollarob commented Apr 9, 2024

jonaaathan commented May 27, 2024

kingjr commented Jul 24, 2024

TasseDeCafe commented Nov 22, 2024

diyism commented Dec 20, 2024

Detecting Syllable / Phonetic Timestamps #762

Detecting Syllable / Phonetic Timestamps #762

Comments

Akz47 commented Mar 31, 2024

SmartManoj commented Apr 1, 2024

Akz47 commented Apr 1, 2024

SmartManoj commented Apr 1, 2024

Akz47 commented Apr 1, 2024

hollarob commented Apr 9, 2024

jonaaathan commented May 27, 2024

kingjr commented Jul 24, 2024

TasseDeCafe commented Nov 22, 2024

diyism commented Dec 20, 2024