-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for realtime audio input #10
Comments
The Whisper model processes the audio in chunks of 30 seconds - this is a hard constraint of the architecture. However, what seems to work is you can take for example 5 seconds of audio and pad it with 25 seconds of silence. This way you can process shorter chunks. Given that, an obvious strategy for realtime audio transcription is the following:
The problem with that is you need to do the same amount of computation for 1 second audio as you would do for 2, 3, ... , 30 seconds of audio. So if your audio input step is 1 second (as shown in the example above), you will effectively do 30 times the computation that you would normally do to process the full 30 seconds. I plan to add a basic example of real-time audio transcription using the above strategy. |
I was a bit afraid that that would be the answer, but I'll definitely will check out that basic example when it's ready! |
- Processes input in chunks of 3 seconds. - Padding audio with silence - Uses 1 second audio from previous pass - No text context
Just added a very naive implementation of the idea above. To run it, simply do: # install sdl2 (Ubuntu)
$ sudo apt-get install libsdl2-dev
# install sdl2 (Mac OS)
$ brew install sdl2
# download a model if you don't have one
$ ./download-ggml-model.sh base.en
# run the real-time audio transcription
$ make stream
$ ./stream -m models/ggml-base.en.bin This example continuously captures audio from the mic and runs whisper on the captured audio. The results are not great because the current implementation can chop the audio in the middle of words. However, all these things can be significantly improved. |
Here is a short video demonstration of near real-time transcription from the microphone: rt_esl_csgo_1.mp4 |
Nice, but somehow can't kill it (stream) I had to do a killall -9 stream. On a AMD 2 thread 3ghz processor with 16GB RAM, there is significant delay. However I found that I get 2x realtime with the usual transcription on audio file. Great work. I love this. |
Thanks for the feedback. Regarding the performance - I hope it can be improved with a better strategy to decide when to perform the inference. Currently, it is done every X seconds, regardless of the data. If we add voice detection, we should be able to run it less often. But overall, it seems that real-time transcription will always be slower compared to the original 30-seconds chunk transcription. |
Thanks for the quick fix. I have some suggestions/ideas, for faster voice transcription. Give me half an hour to one hour, I'll update here with new content. Edit / Updated: Here are some ideas to speed up offline non real time transcription: Removing silence helps a lot in reducing total time of audio (not yet tried but obvious): http://andrewslotnick.com/posts/speeding-up-a-speech.html#Remove-Silence Things that I tried with good results: First I ran an half an hour audio file through https://github.com/xiph/rnnoise code. Then I increased the tempo to 1.5 with sox (tempo preserves pitch). After that I got good results with tiny.en but base.en seemed to be less accurate. Overall process is much faster - real fast transcription except for the initial delay.
Here are some ideas for faster real time transcription: I noticed that when I ran this on a 5 sec clip, I got this result:
Now if we could apply this: 1 VAD / Silence detection(like you mentioned) split into chunks. The result is variable length audio chunks in memory or temp files Example: VAD: https://github.com/cirosilvano/easyvad or maybe use webrtc vad? I guess experimentation is needed to figure out the best strategy / approach to real time considering the 30 sec at once issue. |
Controls how often we run the inference. By default, we run it every 3 seconds.
Seems the results become worse when we keep the context, so by default this is not enabled
Some improvement on the real-time transcription: rt_esl_csgo_2.mp4 |
I'll check this out and give you feedback here tomorrow. Awesome work! Brilliant. |
hello @ggerganov , thanks for sharing!
|
On resource constrained machines it doesn't seem to be better. The previous version worked for transcribing, this one is chocking the cpu with no or only intermittent output. Same kill issue persists - I think its because processes are spawned and makes the system laggy. @ggerganov I also caught a Floating point exception (core dumped) playing around with options : -t 2 --step 5000 --length 5000
|
But I think this not working on resource constrained devices should not be a blocker for you. If it works for everyone else, please feel free to close. |
I think that floating point exception might be related to #39 as well, which was running on a 4 core AMD64 Linux server, not too resource constrained. |
The |
Also mentioning this here since it would be a super cool feature to have: Any way to register a callback or call a script once user speech is completed and silence/non-speak is detected? Been trying to hack on the CPP code but my CPP skills are rusty :( |
@pachacamac Will think about adding this option. Silence/non-speak detection is not trivial in general, but maybe some simple thresholding approach that works in quiet environment should not be too difficult to implement. |
Hi, I am trying to run real-time transcription on the Raspberry Pi 4B, with a ReSpeaker Mic array. Is there any way to specify the audio input device when running |
I was able to specify the default input device through
Curious if you have any luck getting real-time transcription to work on a Pi 4. Mine seems to run just a little too slow to give useful results, even with the |
Hi @alexose and @RyanSelesnik: Have you had any success using Respeaker 4 Mic Array (UAC1.0) to run the stream script on Raspberry?
ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 8 --step 500 --length 5000 -c 0 main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ... [BLANK_AUDIO] main: WARNING: cannot process audio fast enough, dropping audio ... but using whisper on prerecorded audio with the same Respeaker devices the whisper worked well: sudo arecord -f S16_LE -d 10 -r 16000 --device="hw:1,0" /tmp/test-mic.wav Any suggestions to test ./stream properly |
@andres-ramirez-duque I haven't had any luck in getting the streaming functionality to run fast enough. I'm not good enough with C++ to know where to start optimizing, but I suspect the comment in PR #23 sheds some light on the issue:
As well as the notes on optimization from @trholding and @ggerganov above. |
So the rule-of-thumb for using the
Note down the If the step is smaller, then it is very likely that the processing will be slower compared to the audio capture and it won't work. Overall, streaming on Reaspberries is a long shot. Maybe when they start supporting the FP16 arithmetic (i.e. |
Yes, makes sense. For those of us trying to make this work on a cheap single-board computer, we'll probably want to use something like a Banana Pi BPI M5 (which is form-factor compatible with the Pi 4 but ships with a Cortex A55). |
Were you able to implement any of these ideas? Are there significant performance improvements? |
Hi,
And it works, but the result I get is far behind the demo video. It just gets stuck with the first sentence and tries to update it instead of adding new sentences. |
silero-vad looks best for VAD but I don't know how to port this to Swift yet - onnx and python notebook edit: found a swift port, https://github.com/tangfuhao/Silero-VAD-for-iOS |
After making my model using Core ML, I got this error while trying to build stream:
|
This is awesome! Would love to have a high-level description of the optimisations that were made. |
Thank you for you great work! I've added some simple logic to detect silence, and process only real voice input: #1649. |
Try compiling stream with WHISPER_COREML=1 |
Why does the addition of the - l zh parameter only display Traditional Chinese, and the addition of the -- prompt parameter display invalid parameters? How to display Simplified Chinese? And the accuracy of Chinese recognition is not very high, it needs to be solved. Thank you! Thank you very much! |
add --prompt "简体输出" |
使用stream(实时模式)脚本时候加--prompt显示该参数无效>︿<需要的话有空把原来报错发出来你看下(●'◡'●) |
Link against C++ standard library and macOS Accelerate framework
Removes things that were added in ggerganov#10 and adds note on Linux builds
I was wondering if there was a way to provide some text context to improve inference even further? For instance, the model sometimes struggles to understand some technical terms that I am saying when writing to a buffer. But if I had a way of providing that buffer as context, which might already contain those words or concepts relating to them, it might greatly improve that kind of situations. |
you can prepend context in whisper, not sure about this stream implementation's options though. |
I am a Whisper noob and furthermore not quite proficient at C++ but any pointers (pun intented) would be greatly appreciated |
@pachacamac wrote (a while ago)
and AFAICT this isn't the case yet so my basic advice is A cleaner way to do this could be to rely on To get the very last line that might match content |
Is the -vth parameter working as intended? Just compiling the example and testing with large-v3 model (GPU, step set to 700 works fine) I get "hallucinations" when not talking. With English language there's a constant "thank you" output and with Swedish I get absolutely hilarious and rather long sentences from the training data. Tried vth from 0.1 to 48000 without seeing any difference. |
You need to set |
I recently came upon this project from watching this video. It looks very good to me, perhaps the best Whisper "streaming" example I've seen. It seems to be using whisper_streaming under the hood. I wonder how hard would it be to implement their algorithm on whisper.cpp? |
Dear all, I try to using my Raspberry Pi 5 to display real time speech to text in my RP terminal, and below is my step: sudo apt install libsdl2-dev So far it can display, however, why it is always show up additional line paragraph "whisper_mel_init: n_len = 3370, n_len_org = 369, n_mel = 80" on top of my speech to text output? Anyone know how to remove this annoying line? Many thanks. |
Placing |
I'm trying to hack some stuff together, and while doing that, I read this whole thread, it turns out to be very informative. And I want to thank ggerganov and everyone for contribution here. Thanks all |
Is there any intention to get this to working with output devices rather than just input devices? I have been playing around with this locally to try and force SDL2 to register a Windows output device. While I'm able to force this, Whisper doesn't seem to be able to recognise any audio playing through an output device. I'm not sure if this is an SDL2 Limitation, or if it has to do with the AudioSpec configuration. I have configured the AudioSpec to be in Stereo (2 channels) and matched the hz of my audio device (48000hz) in common-sdl.cpp to no avail. Has anyone gotten this usecase working or is it out of scope for whisper-stream? SDL2 Version devel-2.30.11 |
AudioRelay has a plugin (I don't remember the name, you can just install this one) where you can Simulate an Output as Input. |
@weskerty Thanks for the heads up! However, I'm trying to do this without using virtual audio devices or external dependencies. I've been able to get this working with VB-Audio virtual audio devices, but it's not the most elegant solution. I'm going to continue digging into this when I have time, but any suggestions are welcome! |
Noting that the processing time is considerably shorter than the length of speech, is it possible to feed the models real time microphone output? Or does the inference run on the complete audio stream, instead of sample by sample?
This would greatly reduce the latency for voice assistants and the like, that the audio does not need to be fully captured and only after that fed to the models. Basically the same as I did here with SODA: https://github.com/biemster/gasr, but then with an open source and multilang model.
The text was updated successfully, but these errors were encountered: