Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can I know when the audio start #24

Closed
yijinsheng opened this issue Nov 22, 2024 · 10 comments
Closed

how can I know when the audio start #24

yijinsheng opened this issue Nov 22, 2024 · 10 comments

Comments

@yijinsheng
Copy link

I want to develop a LLM based voice to voice app. so I need to know when users start to talk so I can interrupt the LLM and tts output. but I can only see the ReplyOnPause function,what I need is a function which tells me when the user start to talk

@yijinsheng yijinsheng changed the title how can I know whern the audio start how can I know when the audio start Nov 22, 2024
@freddyaboulton
Copy link
Owner

Hi @yijinsheng - you can do this by subclassing ReplyonPause.

Add a started_talking_event here: https://github.com/freddyaboulton/gradio-webrtc/blob/31baa205f52d449bf2e618f32acb25adba2431ed/backend/gradio_webrtc/reply_on_pause.py#L89

When the algorithm determines the user started talking, set the started_talking_event: https://github.com/freddyaboulton/gradio-webrtc/blob/31baa205f52d449bf2e618f32acb25adba2431ed/backend/gradio_webrtc/reply_on_pause.py#L124

In emit you can raise a StopIteration if the started_talking event is set: https://github.com/freddyaboulton/gradio-webrtc/blob/31baa205f52d449bf2e618f32acb25adba2431ed/backend/gradio_webrtc/reply_on_pause.py#L193

Make sure that in reset, you clear the started_talking_event state.

@freddyaboulton
Copy link
Owner

Please let me know how it goes and if you share your demo, I can add it to the cookbook: https://freddyaboulton.github.io/gradio-webrtc/cookbook/

@freddyaboulton
Copy link
Owner

Where you able to figure this out @yijinsheng ?

@duhtapioca
Copy link

duhtapioca commented Dec 31, 2024

@freddyaboulton

I wrote the following class based on your inputs and it seems to be handling interruption detection and replying to the interruption on voice activity well. (Adapted from ReplyOnPause function before the async support changes.)

Click to expand
class ReplyOnPauseAndInterruption(StreamHandler):
    def __init__(
        self,
        fn: ReplyFnGenerator,
        algo_options: AlgoOptions | None = None,
        model_options: SileroVadOptions | None = None,
        expected_layout: Literal["mono", "stereo"] = "mono",
        output_sample_rate: int = 24000,
        output_frame_size: int = 480,
        input_sample_rate: int = 48000,
    ):
        super().__init__(
            expected_layout,
            output_sample_rate,
            output_frame_size,
            input_sample_rate=input_sample_rate,
        )
        self.expected_layout: Literal["mono", "stereo"] = expected_layout
        self.output_sample_rate = output_sample_rate
        self.output_frame_size = output_frame_size
        self.model = get_vad_model()
        self.fn = fn
        self.is_async = inspect.isasyncgenfunction(fn)
        self.event = Event()
        self.state = AppState()
        self.generator = None
        self.model_options = model_options
        self.algo_options = algo_options or AlgoOptions()
        self.latest_args: list[Any] = []
        self.args_set = Event()
        self.started_talking_event = Event()

    @property
    def _needs_additional_inputs(self) -> bool:
        return len(inspect.signature(self.fn).parameters) > 1

    def copy(self):
        return ReplyOnPauseAndInterruption(
            self.fn,
            self.algo_options,
            self.model_options,
            self.expected_layout,
            self.output_sample_rate,
            self.output_frame_size,
            self.input_sample_rate,
        )

    def determine_pause(
        self, audio: np.ndarray, sampling_rate: int, state: AppState
    ) -> bool:
        """Take in the stream, determine if a pause happened"""
        duration = len(audio) / sampling_rate

        if duration >= self.algo_options.audio_chunk_duration:
            dur_vad = self.model.vad((sampling_rate, audio), self.model_options)
            # logger.debug("VAD duration: %s", dur_vad)
            if (
                dur_vad > self.algo_options.started_talking_threshold
                and not state.started_talking
            ):
                state.started_talking = True
                self.started_talking_event.set()
                logger.debug("Started talking")

            if state.started_talking:
                if state.stream is None:
                    state.stream = audio
                else:
                    state.stream = np.concatenate((state.stream, audio))
            state.buffer = None
            if dur_vad < self.algo_options.speech_threshold and state.started_talking:
                return True
        return False

    def process_audio(self, audio: tuple[int, np.ndarray], state: AppState) -> None:
        frame_rate, array = audio
        array = np.squeeze(array)
        if not state.sampling_rate:
            state.sampling_rate = frame_rate
        if state.buffer is None:
            state.buffer = array
        else:
            state.buffer = np.concatenate((state.buffer, array))

        pause_detected = self.determine_pause(
            state.buffer, state.sampling_rate, self.state
        )
        state.pause_detected = pause_detected

    def receive(self, frame: tuple[int, np.ndarray]) -> None:
        self.process_audio(frame, self.state)
        if self.state.pause_detected:
            self.state.started_talking = False
            self.started_talking_event.clear()
            self.event.set()


    def reset(self):
        self.args_set.clear()
        if self.generator and self.state.responding == True:
            if self.is_async:
                logger.debug("Closing async generator")
                asyncio.run_coroutine_threadsafe(self.generator.aclose(), self.loop).result()
            else:
                logger.debug("Closing generator")
                self.generator.close()
        self.generator = None
        self.event.clear()
        self.state.buffer = None
        self.state.stream = None
        self.state.responding = False

    def set_args(self, args: list[Any]):
        super().set_args(args)
        self.args_set.set()

    async def fetch_args(
        self,
    ):
        if self.channel:
            self.channel.send("tick")
            logger.debug("Sent tick")

    async def async_iterate(self, generator) -> Any:
        return await anext(generator)

    def emit(self):
        if not self.event.is_set():
            return None
        else:
            if self.started_talking_event.is_set():
                    self.reset()
                    return None
                 
            if not self.generator:
                if self._needs_additional_inputs and not self.args_set.is_set():
                    asyncio.run_coroutine_threadsafe(self.fetch_args(), self.loop)
                    self.args_set.wait()
                logger.debug("Creating generator")
                audio = cast(np.ndarray, self.state.stream).reshape(1, -1)
                if self._needs_additional_inputs:
                    self.latest_args[0] = (self.state.sampling_rate, audio)
                    self.generator = self.fn(*self.latest_args)
                else:
                    self.generator = self.fn((self.state.sampling_rate, audio))  # type: ignore
                
            self.state.responding = True
            try:
                logger.debug("Emitting audio")
                if self.is_async:
                    return asyncio.run_coroutine_threadsafe(
                        self.async_iterate(self.generator), self.loop
                    ).result()
                else:
                    return next(self.generator)
                    
            except (StopIteration, StopAsyncIteration):
                self.reset()
                self.state.responding = False
                return None

The only issue is, Although the LLM function (ReplyFnGenerator) is being called again instantly on interruption, the audio stream from TTS is taking 3-4 seconds before stopping - It seems to be emitting the queued audio chunks yielded by the ReplyFnGenerator function before the interruption leading to a delay in stopping the audio stream. Ideally we want the audio stream to stop the moment the user starts talking.

Any advice or input on how to deal with this? My intuition is that the stream handler functionality needs to be modified for this, I'm not too sure. Please let me know.

Thanks!

@duhtapioca
Copy link

duhtapioca commented Jan 6, 2025

@freddyaboulton, Please let me know, any high-level advice would work too

@freddyaboulton
Copy link
Owner

Hi @duhtapioca ! Thanks for your patience, I was on holiday break.

Yes I think you would need to clear the actual output audio queue when the started_talking_event is set.

No way to do that now. I think one thing we can do is store a reference to the output queue in StreamHandlerBase and then clear it on reset (here).

Can you see if that fixes the issue? Happy to merge a PR in if so.

@albertofh98
Copy link

albertofh98 commented Jan 28, 2025

Exactly. Following @freddyaboulton's latest response, you can easily empty the audio_callback queue. For example, I am developing a real-time OpenAI application using the WebRTC library. In my case, to stop the audio after an interruption, I simply need to:

# In my case, I simply declared webrtc as a global variable
elif event.type == "input_audio_buffer.speech_started":
  k = list(webrtc.connections.keys())[0]
  audio_callback = webrtc.connections[k][0]
  
  while not audio_callback.queue.empty():
      # Depending on your program, you may want to
      # catch QueueEmpty
      audio_callback.queue.get_nowait()
      audio_callback.queue.task_done()

Hope it helps!

@albertofh98
Copy link

albertofh98 commented Feb 20, 2025

Hi again!

While the previous implementation for managing interruptions generally works, I've encountered an issue where, after some interruptions, the assistant remains silent for several seconds before responding with the new answer, despite the audio queue being cleared and new audio chunks being processed for the next response. Additionally, there are instances where the assistant does not respond at all after an interruption, failing to play any further audio.

What do you think could be causing this? @freddyaboulton

@mahimairaja
Copy link

@albertofh98 are we able to fix it?

@freddyaboulton
Copy link
Owner

Hello! ReplyOnPause and ReplyOnStopWords can now be interrupted by default as of version 0.0.11. No need for extra workarounds. You can disable this with can_interrupt parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants