how can I know when the audio start #24

yijinsheng · 2024-11-22T09:14:59Z

I want to develop a LLM based voice to voice app. so I need to know when users start to talk so I can interrupt the LLM and tts output. but I can only see the ReplyOnPause function,what I need is a function which tells me when the user start to talk

freddyaboulton · 2024-11-25T20:38:41Z

Hi @yijinsheng - you can do this by subclassing ReplyonPause.

Add a started_talking_event here: https://github.com/freddyaboulton/gradio-webrtc/blob/31baa205f52d449bf2e618f32acb25adba2431ed/backend/gradio_webrtc/reply_on_pause.py#L89

When the algorithm determines the user started talking, set the started_talking_event: https://github.com/freddyaboulton/gradio-webrtc/blob/31baa205f52d449bf2e618f32acb25adba2431ed/backend/gradio_webrtc/reply_on_pause.py#L124

In emit you can raise a StopIteration if the started_talking event is set: https://github.com/freddyaboulton/gradio-webrtc/blob/31baa205f52d449bf2e618f32acb25adba2431ed/backend/gradio_webrtc/reply_on_pause.py#L193

Make sure that in reset, you clear the started_talking_event state.

freddyaboulton · 2024-11-25T20:39:19Z

Please let me know how it goes and if you share your demo, I can add it to the cookbook: https://freddyaboulton.github.io/gradio-webrtc/cookbook/

freddyaboulton · 2024-12-16T20:57:50Z

Where you able to figure this out @yijinsheng ?

duhtapioca · 2024-12-31T14:38:14Z

@freddyaboulton

I wrote the following class based on your inputs and it seems to be handling interruption detection and replying to the interruption on voice activity well. (Adapted from ReplyOnPause function before the async support changes.)

Click to expand

class ReplyOnPauseAndInterruption(StreamHandler):
    def __init__(
        self,
        fn: ReplyFnGenerator,
        algo_options: AlgoOptions | None = None,
        model_options: SileroVadOptions | None = None,
        expected_layout: Literal["mono", "stereo"] = "mono",
        output_sample_rate: int = 24000,
        output_frame_size: int = 480,
        input_sample_rate: int = 48000,
    ):
        super().__init__(
            expected_layout,
            output_sample_rate,
            output_frame_size,
            input_sample_rate=input_sample_rate,
        )
        self.expected_layout: Literal["mono", "stereo"] = expected_layout
        self.output_sample_rate = output_sample_rate
        self.output_frame_size = output_frame_size
        self.model = get_vad_model()
        self.fn = fn
        self.is_async = inspect.isasyncgenfunction(fn)
        self.event = Event()
        self.state = AppState()
        self.generator = None
        self.model_options = model_options
        self.algo_options = algo_options or AlgoOptions()
        self.latest_args: list[Any] = []
        self.args_set = Event()
        self.started_talking_event = Event()

    @property
    def _needs_additional_inputs(self) -> bool:
        return len(inspect.signature(self.fn).parameters) > 1

    def copy(self):
        return ReplyOnPauseAndInterruption(
            self.fn,
            self.algo_options,
            self.model_options,
            self.expected_layout,
            self.output_sample_rate,
            self.output_frame_size,
            self.input_sample_rate,
        )

    def determine_pause(
        self, audio: np.ndarray, sampling_rate: int, state: AppState
    ) -> bool:
        """Take in the stream, determine if a pause happened"""
        duration = len(audio) / sampling_rate

        if duration >= self.algo_options.audio_chunk_duration:
            dur_vad = self.model.vad((sampling_rate, audio), self.model_options)
            # logger.debug("VAD duration: %s", dur_vad)
            if (
                dur_vad > self.algo_options.started_talking_threshold
                and not state.started_talking
            ):
                state.started_talking = True
                self.started_talking_event.set()
                logger.debug("Started talking")

            if state.started_talking:
                if state.stream is None:
                    state.stream = audio
                else:
                    state.stream = np.concatenate((state.stream, audio))
            state.buffer = None
            if dur_vad < self.algo_options.speech_threshold and state.started_talking:
                return True
        return False

    def process_audio(self, audio: tuple[int, np.ndarray], state: AppState) -> None:
        frame_rate, array = audio
        array = np.squeeze(array)
        if not state.sampling_rate:
            state.sampling_rate = frame_rate
        if state.buffer is None:
            state.buffer = array
        else:
            state.buffer = np.concatenate((state.buffer, array))

        pause_detected = self.determine_pause(
            state.buffer, state.sampling_rate, self.state
        )
        state.pause_detected = pause_detected

    def receive(self, frame: tuple[int, np.ndarray]) -> None:
        self.process_audio(frame, self.state)
        if self.state.pause_detected:
            self.state.started_talking = False
            self.started_talking_event.clear()
            self.event.set()


    def reset(self):
        self.args_set.clear()
        if self.generator and self.state.responding == True:
            if self.is_async:
                logger.debug("Closing async generator")
                asyncio.run_coroutine_threadsafe(self.generator.aclose(), self.loop).result()
            else:
                logger.debug("Closing generator")
                self.generator.close()
        self.generator = None
        self.event.clear()
        self.state.buffer = None
        self.state.stream = None
        self.state.responding = False

    def set_args(self, args: list[Any]):
        super().set_args(args)
        self.args_set.set()

    async def fetch_args(
        self,
    ):
        if self.channel:
            self.channel.send("tick")
            logger.debug("Sent tick")

    async def async_iterate(self, generator) -> Any:
        return await anext(generator)

    def emit(self):
        if not self.event.is_set():
            return None
        else:
            if self.started_talking_event.is_set():
                    self.reset()
                    return None
                 
            if not self.generator:
                if self._needs_additional_inputs and not self.args_set.is_set():
                    asyncio.run_coroutine_threadsafe(self.fetch_args(), self.loop)
                    self.args_set.wait()
                logger.debug("Creating generator")
                audio = cast(np.ndarray, self.state.stream).reshape(1, -1)
                if self._needs_additional_inputs:
                    self.latest_args[0] = (self.state.sampling_rate, audio)
                    self.generator = self.fn(*self.latest_args)
                else:
                    self.generator = self.fn((self.state.sampling_rate, audio))  # type: ignore
                
            self.state.responding = True
            try:
                logger.debug("Emitting audio")
                if self.is_async:
                    return asyncio.run_coroutine_threadsafe(
                        self.async_iterate(self.generator), self.loop
                    ).result()
                else:
                    return next(self.generator)
                    
            except (StopIteration, StopAsyncIteration):
                self.reset()
                self.state.responding = False
                return None

The only issue is, Although the LLM function (ReplyFnGenerator) is being called again instantly on interruption, the audio stream from TTS is taking 3-4 seconds before stopping - It seems to be emitting the queued audio chunks yielded by the ReplyFnGenerator function before the interruption leading to a delay in stopping the audio stream. Ideally we want the audio stream to stop the moment the user starts talking.

Any advice or input on how to deal with this? My intuition is that the stream handler functionality needs to be modified for this, I'm not too sure. Please let me know.

Thanks!

duhtapioca · 2025-01-06T06:15:38Z

@freddyaboulton, Please let me know, any high-level advice would work too

freddyaboulton · 2025-01-07T00:17:12Z

Hi @duhtapioca ! Thanks for your patience, I was on holiday break.

Yes I think you would need to clear the actual output audio queue when the started_talking_event is set.

No way to do that now. I think one thing we can do is store a reference to the output queue in StreamHandlerBase and then clear it on reset (here).

Can you see if that fixes the issue? Happy to merge a PR in if so.

albertofh98 · 2025-01-28T15:04:42Z

Exactly. Following @freddyaboulton's latest response, you can easily empty the audio_callback queue. For example, I am developing a real-time OpenAI application using the WebRTC library. In my case, to stop the audio after an interruption, I simply need to:

# In my case, I simply declared webrtc as a global variable
elif event.type == "input_audio_buffer.speech_started":
  k = list(webrtc.connections.keys())[0]
  audio_callback = webrtc.connections[k][0]
  
  while not audio_callback.queue.empty():
      # Depending on your program, you may want to
      # catch QueueEmpty
      audio_callback.queue.get_nowait()
      audio_callback.queue.task_done()

Hope it helps!

albertofh98 · 2025-02-20T13:43:46Z

Hi again!

While the previous implementation for managing interruptions generally works, I've encountered an issue where, after some interruptions, the assistant remains silent for several seconds before responding with the new answer, despite the audio queue being cleared and new audio chunks being processed for the next response. Additionally, there are instances where the assistant does not respond at all after an interruption, failing to play any further audio.

What do you think could be causing this? @freddyaboulton

mahimairaja · 2025-03-03T22:53:01Z

@albertofh98 are we able to fix it?

freddyaboulton · 2025-03-04T02:58:31Z

Hello! ReplyOnPause and ReplyOnStopWords can now be interrupted by default as of version 0.0.11. No need for extra workarounds. You can disable this with can_interrupt parameter.

yijinsheng changed the title ~~how can I know whern the audio start~~ how can I know when the audio start Nov 22, 2024

freddyaboulton closed this as completed Dec 7, 2024

freddyaboulton mentioned this issue Dec 14, 2024

Handling interruptions for Conversational AI #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how can I know when the audio start #24

how can I know when the audio start #24

yijinsheng commented Nov 22, 2024

freddyaboulton commented Nov 25, 2024

freddyaboulton commented Nov 25, 2024

freddyaboulton commented Dec 16, 2024

duhtapioca commented Dec 31, 2024 •

edited

Loading

duhtapioca commented Jan 6, 2025 •

edited

Loading

freddyaboulton commented Jan 7, 2025

albertofh98 commented Jan 28, 2025 •

edited

Loading

albertofh98 commented Feb 20, 2025 •

edited

Loading

mahimairaja commented Mar 3, 2025

freddyaboulton commented Mar 4, 2025

how can I know when the audio start #24

how can I know when the audio start #24

Comments

yijinsheng commented Nov 22, 2024

freddyaboulton commented Nov 25, 2024

freddyaboulton commented Nov 25, 2024

freddyaboulton commented Dec 16, 2024

duhtapioca commented Dec 31, 2024 • edited Loading

duhtapioca commented Jan 6, 2025 • edited Loading

freddyaboulton commented Jan 7, 2025

albertofh98 commented Jan 28, 2025 • edited Loading

albertofh98 commented Feb 20, 2025 • edited Loading

mahimairaja commented Mar 3, 2025

freddyaboulton commented Mar 4, 2025

duhtapioca commented Dec 31, 2024 •

edited

Loading

duhtapioca commented Jan 6, 2025 •

edited

Loading

albertofh98 commented Jan 28, 2025 •

edited

Loading

albertofh98 commented Feb 20, 2025 •

edited

Loading