Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for realtime audio input #10

Closed
biemster opened this issue Oct 1, 2022 · 71 comments
Closed

Support for realtime audio input #10

biemster opened this issue Oct 1, 2022 · 71 comments
Labels
enhancement New feature or request

Comments

@biemster
Copy link

biemster commented Oct 1, 2022

Noting that the processing time is considerably shorter than the length of speech, is it possible to feed the models real time microphone output? Or does the inference run on the complete audio stream, instead of sample by sample?

This would greatly reduce the latency for voice assistants and the like, that the audio does not need to be fully captured and only after that fed to the models. Basically the same as I did here with SODA: https://github.com/biemster/gasr, but then with an open source and multilang model.

@ggerganov ggerganov added the enhancement New feature or request label Oct 1, 2022
@ggerganov
Copy link
Owner

The Whisper model processes the audio in chunks of 30 seconds - this is a hard constraint of the architecture.

However, what seems to work is you can take for example 5 seconds of audio and pad it with 25 seconds of silence. This way you can process shorter chunks.

Given that, an obvious strategy for realtime audio transcription is the following:

T  - [data]
-------------------
1  - [1 audio, 29 silence pad] -> transcribe -> "He"
2  - [2 audio, 28 silence pad] -> transcribe -> "Hello"
3  - [3 audio, 27 silence pad] -> transcribe -> "Hello, my"
...
29 - [29 audio, 1 silence pad] -> transcribe -> "Hello, my name is John ..."

The problem with that is you need to do the same amount of computation for 1 second audio as you would do for 2, 3, ... , 30 seconds of audio. So if your audio input step is 1 second (as shown in the example above), you will effectively do 30 times the computation that you would normally do to process the full 30 seconds.

I plan to add a basic example of real-time audio transcription using the above strategy.

@biemster
Copy link
Author

biemster commented Oct 1, 2022

I was a bit afraid that that would be the answer, but I'll definitely will check out that basic example when it's ready!

ggerganov added a commit that referenced this issue Oct 2, 2022
- Processes input in chunks of 3 seconds.
- Padding audio with silence
- Uses 1 second audio from previous pass
- No text context
@ggerganov
Copy link
Owner

Just added a very naive implementation of the idea above. To run it, simply do:

# install sdl2 (Ubuntu)
$ sudo apt-get install libsdl2-dev

# install sdl2 (Mac OS)
$ brew install sdl2

# download a model if you don't have one
$ ./download-ggml-model.sh base.en

# run the real-time audio transcription
$ make stream
$ ./stream -m models/ggml-base.en.bin

This example continuously captures audio from the mic and runs whisper on the captured audio.
The time step is currently hardcoded at 3 seconds.

The results are not great because the current implementation can chop the audio in the middle of words.
Also, the text context is reset for every new iteration.

However, all these things can be significantly improved.
Probably we need to add some sort of simple VAD as a preprocessing step.

@ggerganov
Copy link
Owner

Here is a short video demonstration of near real-time transcription from the microphone:

rt_esl_csgo_1.mp4

@trholding
Copy link
Contributor

Nice, but somehow can't kill it (stream) I had to do a killall -9 stream. On a AMD 2 thread 3ghz processor with 16GB RAM, there is significant delay. However I found that I get 2x realtime with the usual transcription on audio file. Great work. I love this.

ggerganov added a commit that referenced this issue Oct 2, 2022
@ggerganov
Copy link
Owner

Thanks for the feedback.
I just pushed a fix that should handle Ctrl+C correctly (it can take a few seconds to respond though).

Regarding the performance - I hope it can be improved with a better strategy to decide when to perform the inference. Currently, it is done every X seconds, regardless of the data. If we add voice detection, we should be able to run it less often. But overall, it seems that real-time transcription will always be slower compared to the original 30-seconds chunk transcription.

@trholding
Copy link
Contributor

trholding commented Oct 2, 2022

Thanks for the quick fix. I have some suggestions/ideas, for faster voice transcription. Give me half an hour to one hour, I'll update here with new content.

Edit / Updated:

Here are some ideas to speed up offline non real time transcription:

Removing silence helps a lot in reducing total time of audio (not yet tried but obvious):

http://andrewslotnick.com/posts/speeding-up-a-speech.html#Remove-Silence

Things that I tried with good results:

First I ran an half an hour audio file through https://github.com/xiph/rnnoise code. Then I increased the tempo to 1.5 with sox (tempo preserves pitch). After that I got good results with tiny.en but base.en seemed to be less accurate. Overall process is much faster - real fast transcription except for the initial delay.

cd /tmp
./rnnoise_demo elon16.wav elon16.raw
sox -c 1 -r 16000 -b 16 --encoding signed-integer elon16.raw elon16_denoised.wav
sox elon16_denoised.wav elonT3.wav tempo 1.5
./main -m models/ggml-tiny.en.bin -f /tmp/elonT3.wav

Here are some ideas for faster real time transcription:

I noticed that when I ran this on a 5 sec clip, I got this result:

./main -m models/ggml-tiny.en.bin -f /tmp/rec.wav 
log_mel_spectrogram: recording length: 5.015500 s
...
main: processing 80248 samples (5.0 sec), 2 threads, lang = english, task = transcribe, timestamps = 1 ...

[00:00.000 --> 00:05.000]   Okay, this is a test. I think this will work out nicely.
[00:05.000 --> 00:10.000]   [no audio]
...
main:    total time = 18525.62 ms

Now if we could apply this:

1 VAD / Silence detection(like you mentioned) split into chunks. The result is variable length audio chunks in memory or temp files
2 Remove noise with rrnoise on chunks
3 Speed up chunck by 1.5x preserving pitch (the speed up should just be an option. I learned that anything above 1.5x results are bad except if voice is loud clear and slow to start with, 1.5x is safe. Ideal is 1.1-1.5x max 2x)
4 Since we know exactly how long the sped up chunk is, we won't need to wait for transcription to finish...

Example:
[00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely. <--- We could kill it right here (cos this is the total length of that file / chunck I had as example)
[00:05.000 --> 00:10.000] [no audio] <-- This is processing on empty buffer, when killed would not waste processing

VAD: https://github.com/cirosilvano/easyvad or maybe use webrtc vad?

I guess experimentation is needed to figure out the best strategy / approach to real time considering the 30 sec at once issue.

ggerganov added a commit that referenced this issue Oct 7, 2022
Controls how often we run the inference.
By default, we run it every 3 seconds.
ggerganov added a commit that referenced this issue Oct 7, 2022
Seems the results become worse when we keep the context, so by default
this is not enabled
@ggerganov
Copy link
Owner

Some improvement on the real-time transcription:

rt_esl_csgo_2.mp4

@trholding
Copy link
Contributor

I'll check this out and give you feedback here tomorrow. Awesome work! Brilliant.

@moebiussurfing
Copy link

moebiussurfing commented Oct 11, 2022

hello @ggerganov , thanks for sharing!
Offline main mode tested here on Windows worked fine.

Any small tip to include the SDL to make work the real time app?

@trholding
Copy link
Contributor

On resource constrained machines it doesn't seem to be better. The previous version worked for transcribing, this one is chocking the cpu with no or only intermittent output. Same kill issue persists - I think its because processes are spawned and makes the system laggy.

@ggerganov I also caught a Floating point exception (core dumped) playing around with options : -t 2 --step 5000 --length 5000

./stream -m ./models/ggml-tiny.en.bin -t 2 --step 5000 --length 5000
audio_sdl_init: found 2 capture devices:
audio_sdl_init:    - Capture device #0: 'Built-in Audio'
audio_sdl_init:    - Capture device #1: 'Built-in Audio Analog Stereo'
audio_sdl_init: attempt to open default capture device ...
audio_sdl_init: obtained spec for input device (SDL Id = 2):
audio_sdl_init:     - sample rate:       16000
audio_sdl_init:     - format:            33056 (required: 33056)
audio_sdl_init:     - channels:          1 (required: 1)
audio_sdl_init:     - samples per frame: 1024
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
whisper_model_load: memory size =    11.41 MB 
whisper_model_load: model size  =    73.54 MB

main: processing 80000 samples (step = 5.0 sec / len = 5.0 sec), 2 threads, lang = en, task = transcribe, timestamps = 0 ...

Floating point exception (core dumped)

@trholding
Copy link
Contributor

But I think this not working on resource constrained devices should not be a blocker for you. If it works for everyone else, please feel free to close.

@tazz4843
Copy link
Contributor

tazz4843 commented Oct 11, 2022

I think that floating point exception might be related to #39 as well, which was running on a 4 core AMD64 Linux server, not too resource constrained.

@ggerganov
Copy link
Owner

The stream example should be updated to detect if it is able to process the incoming audio stream in real-time and provide some warning or error if it is not the case. Otherwise, it will behave in undefined way.

@pachacamac
Copy link

pachacamac commented Oct 15, 2022

Also mentioning this here since it would be a super cool feature to have: Any way to register a callback or call a script once user speech is completed and silence/non-speak is detected? Been trying to hack on the CPP code but my CPP skills are rusty :(

@ggerganov
Copy link
Owner

@pachacamac Will think about adding this option. Silence/non-speak detection is not trivial in general, but maybe some simple thresholding approach that works in quiet environment should not be too difficult to implement.

@RyanSelesnik
Copy link

Hi, I am trying to run real-time transcription on the Raspberry Pi 4B, with a ReSpeaker Mic array. Is there any way to specify the audio input device when running ./stream?

@alexose
Copy link

alexose commented Oct 29, 2022

Hi, I am trying to run real-time transcription on the Raspberry Pi 4B, with a ReSpeaker Mic array. Is there any way to specify the audio input device when running ./stream?

I was able to specify the default input device through /etc/asound.conf:

pcm.!default {
  type asym
   playback.pcm {
     type plug
     slave.pcm "hw:0,0"
   }
   capture.pcm {
     type plug
     slave.pcm "hw:1,0"
   }
}

Curious if you have any luck getting real-time transcription to work on a Pi 4. Mine seems to run just a little too slow to give useful results, even with the tiny.en model.

@andres-ramirez-duque
Copy link

Hi @alexose and @RyanSelesnik:

Have you had any success using Respeaker 4 Mic Array (UAC1.0) to run the stream script on Raspberry?
My system conf is:

  • Raspberry Pi 4 Model B Rev 1.4
  • Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-1073-raspi aarch64)

ubuntu@ubuntu:~/usrlib/whisper.cpp$ ./stream -m ./models/ggml-tiny.en.bin -t 8 --step 500 --length 5000 -c 0
audio_sdl_init: found 1 capture devices:
audio_sdl_init: - Capture device #0: 'ReSpeaker 4 Mic Array (UAC1.0), USB Audio'
audio_sdl_init: attempt to open capture device 0 : 'ReSpeaker 4 Mic Array (UAC1.0), USB Audio' ...
audio_sdl_init: obtained spec for input device (SDL Id = 2):
audio_sdl_init: - sample rate: 16000
audio_sdl_init: - format: 33056 (required: 33056)
audio_sdl_init: - channels: 1 (required: 1)
audio_sdl_init: - samples per frame: 1024
whisper_model_load: loading model from './models/ggml-tiny.en.bin'
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 73.58 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB

main: processing 8000 samples (step = 0.5 sec / len = 5.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 0 ...
main: n_new_line = 9

[BLANK_AUDIO]

main: WARNING: cannot process audio fast enough, dropping audio ...


but using whisper on prerecorded audio with the same Respeaker devices the whisper worked well:

sudo arecord -f S16_LE -d 10 -r 16000 --device="hw:1,0" /tmp/test-mic.wav
./main -m models/ggml-tiny.en.bin -f /tmp/test-mic.wav

Any suggestions to test ./stream properly
Cheers
AR

@alexose
Copy link

alexose commented Nov 3, 2022

@andres-ramirez-duque I haven't had any luck in getting the streaming functionality to run fast enough. aarch64 with a Pi 4B 2GB. I've tried compiling with various flags (-Ofast) and trying various step length, thread count, etc.

I'm not good enough with C++ to know where to start optimizing, but I suspect the comment in PR #23 sheds some light on the issue:

On Arm platforms without __ARM_FEATURE_FP16_VECTOR_ARITHMETIC we convert to 32-bit floats. There might be a more efficient way, but this is good for now.

As well as the notes on optimization from @trholding and @ggerganov above.

@ggerganov
Copy link
Owner

So the rule-of-thumb for using the stream example is to first run the bench tool using the model that you want to try. For example:

$ make bench
$ ./bench models/ggml-tiny.en

whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

whisper_print_timings:     load time =   103.94 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =   174.70 ms / 43.67 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time =   278.77 ms

Note down the encode time. In this case, it is 174 ms.
Your --step parameter for the stream tool should be at least x2 the encode time. So in this case: --step 350.

If the step is smaller, then it is very likely that the processing will be slower compared to the audio capture and it won't work.

Overall, streaming on Reaspberries is a long shot. Maybe when they start supporting the FP16 arithmetic (i.e. ARMv8.2 instruction set) it could make sense.

@alexose
Copy link

alexose commented Nov 3, 2022

Yes, makes sense.

For those of us trying to make this work on a cheap single-board computer, we'll probably want to use something like a Banana Pi BPI M5 (which is form-factor compatible with the Pi 4 but ships with a Cortex A55).

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
@andupotorac
Copy link

Thanks for the quick fix. I have some suggestions/ideas, for faster voice transcription. Give me half an hour to one hour, I'll update here with new content.

Edit / Updated:

Here are some ideas to speed up offline non real time transcription:

Removing silence helps a lot in reducing total time of audio (not yet tried but obvious):

http://andrewslotnick.com/posts/speeding-up-a-speech.html#Remove-Silence

Things that I tried with good results:

First I ran an half an hour audio file through https://github.com/xiph/rnnoise code. Then I increased the tempo to 1.5 with sox (tempo preserves pitch). After that I got good results with tiny.en but base.en seemed to be less accurate. Overall process is much faster - real fast transcription except for the initial delay.

cd /tmp
./rnnoise_demo elon16.wav elon16.raw
sox -c 1 -r 16000 -b 16 --encoding signed-integer elon16.raw elon16_denoised.wav
sox elon16_denoised.wav elonT3.wav tempo 1.5
./main -m models/ggml-tiny.en.bin -f /tmp/elonT3.wav

Here are some ideas for faster real time transcription:

I noticed that when I ran this on a 5 sec clip, I got this result:

./main -m models/ggml-tiny.en.bin -f /tmp/rec.wav 
log_mel_spectrogram: recording length: 5.015500 s
...
main: processing 80248 samples (5.0 sec), 2 threads, lang = english, task = transcribe, timestamps = 1 ...

[00:00.000 --> 00:05.000]   Okay, this is a test. I think this will work out nicely.
[00:05.000 --> 00:10.000]   [no audio]
...
main:    total time = 18525.62 ms

Now if we could apply this:

1 VAD / Silence detection(like you mentioned) split into chunks. The result is variable length audio chunks in memory or temp files 2 Remove noise with rrnoise on chunks 3 Speed up chunck by 1.5x preserving pitch (the speed up should just be an option. I learned that anything above 1.5x results are bad except if voice is loud clear and slow to start with, 1.5x is safe. Ideal is 1.1-1.5x max 2x) 4 Since we know exactly how long the sped up chunk is, we won't need to wait for transcription to finish...

Example: [00:00.000 --> 00:05.000] Okay, this is a test. I think this will work out nicely. <--- We could kill it right here (cos this is the total length of that file / chunck I had as example) [00:05.000 --> 00:10.000] [no audio] <-- This is processing on empty buffer, when killed would not waste processing

VAD: https://github.com/cirosilvano/easyvad or maybe use webrtc vad?

I guess experimentation is needed to figure out the best strategy / approach to real time considering the 30 sec at once issue.

Were you able to implement any of these ideas? Are there significant performance improvements?

@bestofman
Copy link

Hi,
I just tried the following command:

make stream
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000

And it works, but the result I get is far behind the demo video. It just gets stuck with the first sentence and tries to update it instead of adding new sentences.

@aehlke
Copy link

aehlke commented Nov 15, 2023

silero-vad looks best for VAD but I don't know how to port this to Swift yet - onnx and python notebook

edit: found a swift port, https://github.com/tangfuhao/Silero-VAD-for-iOS

@ArmanJR
Copy link

ArmanJR commented Dec 3, 2023

After making my model using Core ML, I got this error while trying to build stream:

$ make stream
I whisper.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC:       Apple clang version 15.0.0 (clang-1500.0.40.1)
I CXX:      Apple clang version 15.0.0 (clang-1500.0.40.1)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL examples/stream/stream.cpp examples/common.cpp examples/common-ggml.cpp examples/common-sdl.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o stream `sdl2-config --cflags --libs`  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
ld: Undefined symbols:
  _whisper_coreml_encode, referenced from:
      whisper_build_graph_conv(whisper_context&, whisper_state&, int) in whisper.o
  _whisper_coreml_free, referenced from:
      _whisper_free_state in whisper.o
  _whisper_coreml_init, referenced from:
      _whisper_init_state in whisper.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [stream] Error 1

@jpiabrantes
Copy link

Some improvement on the real-time transcription:

rt_esl_csgo_2.mp4

This is awesome! Would love to have a high-level description of the optimisations that were made.

@artshcherbina
Copy link

artshcherbina commented Dec 16, 2023

Thank you for you great work!

I've added some simple logic to detect silence, and process only real voice input: #1649.

@thiemom
Copy link

thiemom commented Dec 20, 2023

After making my model using Core ML, I got this error while trying to build stream:

$ make stream
I whisper.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC:       Apple clang version 15.0.0 (clang-1500.0.40.1)
I CXX:      Apple clang version 15.0.0 (clang-1500.0.40.1)

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -DGGML_USE_METAL examples/stream/stream.cpp examples/common.cpp examples/common-ggml.cpp examples/common-sdl.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o stream `sdl2-config --cflags --libs`  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
ld: Undefined symbols:
  _whisper_coreml_encode, referenced from:
      whisper_build_graph_conv(whisper_context&, whisper_state&, int) in whisper.o
  _whisper_coreml_free, referenced from:
      _whisper_free_state in whisper.o
  _whisper_coreml_init, referenced from:
      _whisper_init_state in whisper.o
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [stream] Error 1

Try compiling stream with WHISPER_COREML=1

@AimoneAndex
Copy link

Why does the addition of the - l zh parameter only display Traditional Chinese, and the addition of the -- prompt parameter display invalid parameters? How to display Simplified Chinese? And the accuracy of Chinese recognition is not very high, it needs to be solved. Thank you! Thank you very much!

@zixiai
Copy link

zixiai commented Jan 18, 2024

为什么添加-l zh参数只显示繁体中文,添加--prompt参数显示无效参数?如何显示简体中文?而且中文识别的准确率不是很高,需要解决。谢谢你!非常感谢!

add --prompt "简体输出"

@AimoneAndex
Copy link

使用stream(实时模式)脚本时候加--prompt显示该参数无效>︿<需要的话有空把原来报错发出来你看下(●'◡'●)

kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024
Link against C++ standard library and macOS Accelerate framework
kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024
Removes things that were added in ggerganov#10 and adds note on Linux builds
@helins
Copy link

helins commented Feb 15, 2024

stream is really awesome, eventhough it is labeled as merely "naive". Not ready to be shared yet but I have written a Neovim plugin that writes directly to the buffer, updating the line in real time. It feels nothing short of magical.

I was wondering if there was a way to provide some text context to improve inference even further? For instance, the model sometimes struggles to understand some technical terms that I am saying when writing to a buffer. But if I had a way of providing that buffer as context, which might already contain those words or concepts relating to them, it might greatly improve that kind of situations.

@aehlke
Copy link

aehlke commented Feb 15, 2024

you can prepend context in whisper, not sure about this stream implementation's options though.

@helins
Copy link

helins commented Feb 17, 2024

I am a Whisper noob and furthermore not quite proficient at C++ but any pointers (pun intented) would be greatly appreciated

@Utopiah
Copy link

Utopiah commented Mar 10, 2024

@pachacamac wrote (a while ago)

Any way to register a callback or call a script once user speech is completed and silence/non-speak is detected? Been trying to hack on the CPP code but my CPP skills are rusty

and AFAICT this isn't the case yet so my basic advice is ./stream -f liveoutput getting the result in a text file then watch cat liveoutput to periodically (here every 2s) show the result.

A cleaner way to do this could be to rely on inotify to check if the file has been modified then act on that. Overall a bit of logic has to be added on top, e.g is the sentence detected new or filtering out things like [typing sounds] or (keyboard clicking), but I imagine it's enough to start without having to touch any C++ code.

To get the very last line that might match content cat liveoutput | grep -v '(' | grep -v ']' | tail -1

@troed
Copy link

troed commented Mar 13, 2024

Is the -vth parameter working as intended? Just compiling the example and testing with large-v3 model (GPU, step set to 700 works fine) I get "hallucinations" when not talking. With English language there's a constant "thank you" output and with Swedish I get absolutely hilarious and rather long sentences from the training data.

Tried vth from 0.1 to 48000 without seeing any difference.

@Toddinuk
Copy link

When I run the command "./stream -m ./models/ggml-large-v3.bin -t 8 --step 500 --length 5000 -l zh", the parameter "-l zh" does not seem to function properly. If I speak English, it transcribes correctly, but when I speak Chinese, there is no response. Additionally, there are some unknown Chinese characters displayed.
SCR-20240314-swgp

@zaccheus
Copy link

When I run the command "./stream -m ./models/ggml-large-v3.bin -t 8 --step 500 --length 5000 -l zh", the parameter "-l zh" does not seem to function properly. If I speak English, it transcribes correctly, but when I speak Chinese, there is no response. Additionally, there are some unknown Chinese characters displayed. SCR-20240314-swgp

I also encountered this problem

@pprobst
Copy link
Contributor

pprobst commented Apr 8, 2024

Is the -vth parameter working as intended? Just compiling the example and testing with large-v3 model (GPU, step set to 700 works fine) I get "hallucinations" when not talking. With English language there's a constant "thank you" output and with Swedish I get absolutely hilarious and rather long sentences from the training data.

Tried vth from 0.1 to 48000 without seeing any difference.

You need to set --step 0. See: https://github.com/ggerganov/whisper.cpp/tree/master/examples/stream#sliding-window-mode-with-vad

@pprobst
Copy link
Contributor

pprobst commented Apr 9, 2024

I recently came upon this project from watching this video. It looks very good to me, perhaps the best Whisper "streaming" example I've seen.

It seems to be using whisper_streaming under the hood. I wonder how hard would it be to implement their algorithm on whisper.cpp?

@olayno
Copy link

olayno commented Jun 22, 2024

Dear all, I try to using my Raspberry Pi 5 to display real time speech to text in my RP terminal, and below is my step:

sudo apt install libsdl2-dev
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j stream
./models/download-ggml-model.sh tiny

So far it can display, however, why it is always show up additional line paragraph "whisper_mel_init: n_len = 3370, n_len_org = 369, n_mel = 80" on top of my speech to text output?

Anyone know how to remove this annoying line?

Many thanks.

@weskerty
Copy link

weskerty commented Jul 23, 2024

Placing 2>/dev/null at the end of your line will hide the messages and only show the result of the transcription.

@Kishlay-notabot
Copy link

I'm trying to hack some stuff together, and while doing that, I read this whole thread, it turns out to be very informative. And I want to thank ggerganov and everyone for contribution here. Thanks all

@Soapwood
Copy link

Soapwood commented Jan 4, 2025

Is there any intention to get this to working with output devices rather than just input devices?

I have been playing around with this locally to try and force SDL2 to register a Windows output device. While I'm able to force this, Whisper doesn't seem to be able to recognise any audio playing through an output device.

image

I'm not sure if this is an SDL2 Limitation, or if it has to do with the AudioSpec configuration. I have configured the AudioSpec to be in Stereo (2 channels) and matched the hz of my audio device (48000hz) in common-sdl.cpp to no avail.

Has anyone gotten this usecase working or is it out of scope for whisper-stream?

SDL2 Version devel-2.30.11
Whisper v1.6.0

@weskerty
Copy link

weskerty commented Jan 5, 2025

Is there any intention to get this to working with output devices rather than just input devices?

AudioRelay has a plugin (I don't remember the name, you can just install this one) where you can Simulate an Output as Input.
Although I think it doesn't recognize the text where there is music

@Soapwood
Copy link

Soapwood commented Jan 6, 2025

@weskerty Thanks for the heads up! However, I'm trying to do this without using virtual audio devices or external dependencies. I've been able to get this working with VB-Audio virtual audio devices, but it's not the most elegant solution.

I'm going to continue digging into this when I have time, but any suggestions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests