Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model gets stuck in some words #172

Closed
CarlitosDev opened this issue Nov 23, 2022 · 8 comments · Fixed by #291
Closed

Model gets stuck in some words #172

CarlitosDev opened this issue Nov 23, 2022 · 8 comments · Fixed by #291
Labels
enhancement New feature or request

Comments

@CarlitosDev
Copy link

Last Whisper.cpp version On Mac M1
Model ggml-medium.en.bin
Additional parameters: -t 8 -ml 1
Mono audio file

It seems that the model gets stuck in some words and misses the actual conversation.

Screenshot 2022-11-23 at 11 54 09
Screenshot 2022-11-23 at 11 57 14

@CarlitosDev CarlitosDev changed the title Model get _stuck_ in some words Model gets _stuck_ in some words Nov 23, 2022
@CarlitosDev CarlitosDev changed the title Model gets _stuck_ in some words Model gets stuck in some words Nov 23, 2022
@ggerganov
Copy link
Owner

I believe this is a known limitation of the model - see this discussion for more info:

openai/whisper#29

There are various strategies that can be added to reduce the occurrence of this behaviour (i.e. beam search decoding, temperature fallbacks, VAD, etc.). Some of these are already available in the original implementation from OpenAI, so you can try running it and see if this resolves your issue.

@szeidner
Copy link

szeidner commented Dec 9, 2022

I've run into this issue as well, but see a difference between the output of Whisper (python) vs Whisper.cpp. While there are some repeated words in the python version of Whisper, there are pretty long sections where a phrase is repeated (up to 8 minutes or so) with Whisper.cpp. I wonder if there is anything that can be done to improve the behavior. Do you think maybe this difference is due to using beam search decoding or something similar in the original implementation? If so, I wonder how difficult it would be to implement that in c++?

I've attached the output from both versions of whisper for comparison. I ran it on this podcast episode with the tiny model used for both runs.

whisper.python.txt
whisper.cpp.txt

@ggerganov
Copy link
Owner

@szeidner
Yes, it's likely due to the inferior decoding strategy in whisper.cpp.
I've made some improvements lately - you might give it another try, but probably your case is still going to fail.
I think we need the temperature feature from the OpenAI decoding method to fix this.
Implementation is not very difficult, but I keep prioritising other stuff.

@geimist
Copy link

geimist commented Dec 16, 2022

I also keep having this problem, which is why I keep having to discard tasks, unfortunately. A workaround would be great. 👍

@ggerganov ggerganov reopened this Dec 16, 2022
@ggerganov ggerganov added the enhancement New feature or request label Dec 16, 2022
@szeidner
Copy link

@ggerganov Thanks for looking into this! I do seem to run into this issue on most podcasts I've tried, so an implementation of temperature as a potential fix would be awesome. Thank you!

@janngobble
Copy link

I'm def having this issue as well. I'm having it with -l it (I'm transcoding Italian then using an external engine to translate to EN - colloquialisms are so hard to deal with in some translators and this is a detective TV series "Murders at Barlume"), but it still gets stuck for ~1-15 minutes on one random phrase. (audio format PCM/WAV, 1 channel, 16 bits, ~1 hr 30 min long)

Having SAID that, the output of cpp is so much faster than whisper, it's worth it to try it on a show to see if it works and if it doesn't, restart or run in whisper - cos where it DOES work, it is so much faster on my M1 MBP 13" that it's worth the time.

Thanks for the work, @ggerganov! I'll keep following (and updating my repo) to see if things get better. If you need a sample, please let me know).

@janngobble
Copy link

I think we need the temperature feature from the OpenAI decoding method to fix this.
Implementation is not very difficult, but I keep prioritising other stuff.

You can't say stuff like this and just expect someone is not gonna give the obvious reply - which as I am a programmer myself - I absolutely WILL NOT say... 😂

I respect all the work you do too much to do that!

@RndyP
Copy link

RndyP commented Dec 30, 2022

I'm seeing the same issue. For instance, I send 10 seconds of audio that has simply the number "six" repeated six times, and Whisper gets to work on it and takes a half minute to come back with 100 sixes. During the time it's cranking on it, the CPU is really loaded, which is not good.

Issue #29 talks about silence gaps causing this behaviour, but saying "six" six times in 10 seconds is not a whole lot of silence. Maybe after the 3rd "six" it's the devil's number and this is hanging it up :)

Also, the NULL pointer problem in issue #344 occurs often when it gets stuck in this loop.

@ggerganov ggerganov linked a pull request Jan 8, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants