Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting Syllable / Phonetic Timestamps #762

Open
Akz47 opened this issue Mar 31, 2024 · 9 comments
Open

Detecting Syllable / Phonetic Timestamps #762

Akz47 opened this issue Mar 31, 2024 · 9 comments

Comments

@Akz47
Copy link

Akz47 commented Mar 31, 2024

May I know if there is a way to use WhisperX to generate timestamps of syllables or phonemes, instead of words detected Whisper model?

Our use case is to detect pronunciations / syllables in audio recordings, and sometimes words are not properly detected / omitted by Whisper (even for large models).

It would be helpful if we could obtain the syllable / phonetic timestamps, even if it is not a recognized word.

Thank you.

@SmartManoj
Copy link

To give you the most relevant advice, please explain your primary use case for detecting syllables or phonemes with WhisperX. For example, are you focusing on language learning, speech therapy, or another area? I want you to know that understanding your main objective will help us avoid the XY problem and offer more targeted assistance.

Thank you.

@Akz47
Copy link
Author

Akz47 commented Apr 1, 2024

@SmartManoj Thank you for your reply.

We are experimenting with pronunciation improvement, detecting and analyzing how people pronounce syllables / phonemes. These sounds may not constitute a complete / actual word, and Whisper seems to output only recognized words.

For our use case, it might not be necessary to identify the spoken word, but rather, the syllables and corresponding timestamps.

We have also been scouting the web for alternative Python modules that can split speech audio into syllables, but have not come across a suitable option.

Can you please advise? Thanks.

@SmartManoj
Copy link

You might find the My-Voice Analysis library useful for you. It's a Python library developed for voice analysis that can detect syllable boundaries in audio files without needing transcription. Here's the GitHub repository for more information and usage instructions: My-Voice Analysis.

@Akz47
Copy link
Author

Akz47 commented Apr 1, 2024

@SmartManoj Thank you, that reference is really helpful. We will experiment with that module to generate the syllables and timestamps.

Just to reconfirm, can Whisper / Whisperx also be tweaked to detect syllables, or it only works with proper words?

@hollarob
Copy link

hollarob commented Apr 9, 2024

I've also would be interested in phoneme-based timestamps.

@jonaaathan
Copy link

we would also be interested in it , if it can be run locally and delivered the phonemes predictions similar to azure API

@kingjr
Copy link

kingjr commented Jul 24, 2024

Also interested in getting phoneme level time stamps (for neuroscience of speech research)

@TasseDeCafe
Copy link

@kingjr have you been able to find something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants