-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detecting Syllable / Phonetic Timestamps #762
Comments
To give you the most relevant advice, please explain your primary use case for detecting syllables or phonemes with WhisperX. For example, are you focusing on language learning, speech therapy, or another area? I want you to know that understanding your main objective will help us avoid the XY problem and offer more targeted assistance. Thank you. |
@SmartManoj Thank you for your reply. We are experimenting with pronunciation improvement, detecting and analyzing how people pronounce syllables / phonemes. These sounds may not constitute a complete / actual word, and Whisper seems to output only recognized words. For our use case, it might not be necessary to identify the spoken word, but rather, the syllables and corresponding timestamps. We have also been scouting the web for alternative Python modules that can split speech audio into syllables, but have not come across a suitable option. Can you please advise? Thanks. |
You might find the My-Voice Analysis library useful for you. It's a Python library developed for voice analysis that can detect syllable boundaries in audio files without needing transcription. Here's the GitHub repository for more information and usage instructions: My-Voice Analysis. |
@SmartManoj Thank you, that reference is really helpful. We will experiment with that module to generate the syllables and timestamps. Just to reconfirm, can Whisper / Whisperx also be tweaked to detect syllables, or it only works with proper words? |
I've also would be interested in phoneme-based timestamps. |
we would also be interested in it , if it can be run locally and delivered the phonemes predictions similar to azure API |
Also interested in getting phoneme level time stamps (for neuroscience of speech research) |
@kingjr have you been able to find something? |
I've tested 3 projects:
I think the 3rd is the best up to today, but it's not perfect. |
May I know if there is a way to use WhisperX to generate timestamps of syllables or phonemes, instead of words detected Whisper model?
Our use case is to detect pronunciations / syllables in audio recordings, and sometimes words are not properly detected / omitted by Whisper (even for large models).
It would be helpful if we could obtain the syllable / phonetic timestamps, even if it is not a recognized word.
Thank you.
The text was updated successfully, but these errors were encountered: