Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feedback] Voice Calls (alpha) #175

Open
enricoros opened this issue Oct 24, 2023 · 8 comments
Open

[Feedback] Voice Calls (alpha) #175

enricoros opened this issue Oct 24, 2023 · 8 comments
Assignees

Comments

@enricoros
Copy link
Owner

Instructions and feedback thread for Voice Calls in big-AGI.

1. Start a Voice call

Note: it's best to start a call on an existing chat, so that both ends (the AI Persona you call, and yourself)
have the most available context

There are two ways of initiating a Voice Call from an existing chat:

  1. "Call" button at the bottom (right on desktop, left on mobile)
  2. "Call" button on the persona selector

image
image

2. System Check

Make sure all the checks are green, or try to resolve the issues before proceeding. This wizard will only be shown
the first time, unless the issues persist.
image

3. Call Options

During a call, you can switch "Push To Talk" on/off. If active (default) then the microphone needs to be
pushed before speaking. This is best to avoid echoes and other ambient noise.
image

Note - you can also say the following commands during a call. These single words will be interpreted as system commands:

  • goodbye: ends the call
  • retry: regenerates the last message
  • restart: starts from the beginning

Known limitations:

  • Works best on headphones as there is no echo suppression; the AI Voice itself may be recorded and loop
  • At the end of a call, it is not summarized or appended to the chat history (just yet)
  • The call can run out of tokens_, in which case the persona will read out loud the error message
  • Changing the voice will also restart the chat

🙌
Looking forward to your feedback to prioritize the right integration and development!
🙌

@enricoros enricoros added the RFC label Oct 24, 2023
@enricoros enricoros self-assigned this Oct 24, 2023
@DeFiFoFum
Copy link
Contributor

DeFiFoFum commented Oct 25, 2023

  1. does it work, or what are the issues?
    1.1 🟢 Voice-to-text and text-to-voice seem to work really well. I tried a few voices and I think they sound great.
    1.2 🔴 If the AI has a long speech response, it doesn't seem that there is a way to interrupt it.
    1.2.1 I asked a new question during a previous response and it kept going
    1.2.2 The AI was still speaking and I clicked the back arrow to go back to the chat and it was still speaking with the call window closed

  2. how to make it better - what would you improve?
    2.1 Being able to interrupt (1.2)
    2.2 Making it more hands free. The nice part about a call is that you can be hands free.
    2.2.1 Maybe once the AI stops speaking it starts listening again. Or it is always listening during the speech, but it only responds if you say, "excuse me" or something.
    2.3 Be able to see the call conversation in the chat window.

  3. is it useful at all? - how would you add some WOW-Factor
    3.1 It's a good start to conversational AI imho, but I will need to be able to do more with the call AI to make it more useful for me. Things like:
    3.1.1 "Please look up the news for XYZ and tell me about it"
    3.1.2 "Please make an outline of our conversation and add it to the text chat window"
    3.1.3 Hands free would be so great.

@vaibhavard
Copy link

Suggestions:

Suno Bark:an opensource alternative to elevenlabs api for speech synthesis.

  • It is free and will seem very much more realistic , enhancing that WOW factor.
  • Check out an example here - https://huggingface.co/spaces/suno/bark
  • Bark can add sound effects to the call , such as laughs , etc.GPT can be prompted to express emotions.

Info about the free and open source speech synthesis model Bark:
Bark is a universal text-to-audio model created by Suno, with code publicly available here. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. This demo should be used for research purposes only. Commercial use is strictly prohibited. The model output is not censored and the authors do not endorse the opinions in the generated content. Use at your own risk.

@jontybrook
Copy link
Contributor

does it work, or what are the issues?

  • I'm on macOS (M1). It works well in Chrome, but for some reason it doesn't work in Arc Browser (Chromium based). I'm getting Error occurred during speech recognition: network when clicking the push to talk button. No other obvious logs or failed HTTP requests in the console.
  • using a speaker with the 'push to talk' switch off (continuous listening) results in the AI hearing itself and looping / thanking itself endlessly (lol). A solution to this could be to automatically stop listening when speech-to-text is streaming, and resume again when it ends. This would
  • I got a random 405 error Failed to load resource: the server responded with a status of 405 () during one of my conversations. There was no UI feedback to indicate an error had occurred.
  • i'm seeing some layout shift when using the voice selection drop-down.

how to make it better - what would you improve?

  • remove the 'call timer': I get that the design is mimicking a phone call.. but this isn't a phone call and I feel that having the timer there is over-complicating it a bit. It gives me a sort-of anxiety that i'm using up tokens or something (I realise it's not.. but many users may think that every second costs them money!). I also want to keep the window open on my screen so I can engage with it at any time during the day to ask questions. Having the timer there makes this feel wrong; like I need to give it my full attention or something.
  • more prominence to the chat transcript: I feel the transcript is a more important UI element than the persona icon / avatar; so I would prefer the avatar to shrink when resizing the window so that the transcript always has eg 30vh to scroll within.
  • don't show the streaming text response until the speech audio starts playing. show an animated loading spinner in place of the chat bubble until the speech starts to play. this would make

is it useful at all? - how would you add some WOW-Factor

  • to add wow-factor, I would add some animated UI, such as a pulsating mic icon when listening. The dream would be a dynamic animation which wobbles as the AI speech audio plays (like GPT voice on mobile) but i'm not sure this would be achievable in the same way on web.

squeeze your brain for more ideas

  • conversational error handling: if an error occurs during the conversation, you could play a pre-cached voice message saying something like 'I seem to be having some technical difficulty right now. Please try again.', or even a specific message depending on the error, eg 'you seem to have lost connection to the internet'.
  • I often use big-agi to ask AI models coding questions. Currently, this doesn't work well in a STT interface, as the voice tries to read out the code, which isn't fun to listen to nor very useful. What could make this use case much better would be to strip out any code blocks from the textual model response before passing it to the STT. This would result in the voice reading any explanation parts of the response but not read the actual code. Even better, you could replace code blocks with '(See the code on your screen)' in voice script.
  • Here's my harsh, critical feedback on this: it's pretty useful and a good experience, but as a 'power user' I would prefer something less 'have a call with the AI' and more 'voice mode in big-agi'. Does this make sense? In other words, if the primary goal of speech functionality in the app is to make interacting with AI models as quick and frictionless as possible; I feel the 'phone call' UX analogy actually adds friction. Receiving a call from the AI is cool, but ultimately it's a gimmick. Let's say i'm in a chat with an AI and I CBA to type. What I really want is to be able to click a button and get speech in, speech out with one button; alongside the text chat UI.

That's my feedback; hope it's useful. Keep up the good work with big-agi. I use it every day!

enricoros added a commit that referenced this issue Nov 10, 2023
See also #175. This accomplishes a similar function in an elegant way.
@enricoros enricoros assigned enricoros and unassigned enricoros Nov 16, 2023
@enricoros enricoros added this to the vNext milestone Nov 19, 2023
@enricoros enricoros modified the milestones: 1.6.0, 1.7.0 Nov 28, 2023
@enricoros enricoros moved this from In Progress to Committed in big-AGI build-in-public roadmap Nov 28, 2023
@enricoros enricoros removed this from the 1.7.0 milestone Nov 28, 2023
@enricoros enricoros removed the RFC label Jan 9, 2024
@enricoros enricoros moved this from Committed to In Progress in big-AGI build-in-public roadmap Jan 17, 2024
@agus4402
Copy link

Any voice feature is not working at brave

@enricoros
Copy link
Owner Author

Any voice feature is not working at brave

Yes, sadly Brave does not support the Web Speech API for voice input.

@enricoros enricoros moved this from In Progress to Committed in big-AGI build-in-public roadmap Feb 4, 2024
@gitwittidbit
Copy link

Having issues in Firefox (on Mac). While I activated speech recognition in the browser settings, it does not seem to work: I talk but I get no reaction from the AI.

@githubbozo77
Copy link

githubbozo77 commented May 3, 2024

I think the voice is a great feature - really been looking for something like this - but it would be best it it really worked like a phone call. right now the thing keeps chiming when "listening" and it's kinda annoying and discruptive to the conversation - especially if you try and put it in the hands free mode (as opposed to the push to talk) - I've seen this done in other chat via browser where it's more a stream listening to the microphone. In order to get rid of the sound looping where the AI hears it's self speaking and responds to itself - i've seen it implemented where when the computer is speaking it shuts off the microphone until the sound has stopped playing - (in the case i'm talking about voxta.ai - the microphone icon goes red with a slash showing that it's not listening when the ai is speaking) - this stops the sound looping so it even works without headphones. the implementation they have on voxta.ai work smoothly - you can go back and forth like using pi ai or the chatgpt conversation mode - it's really cool. when it's speaking if you are wearing headphones they even have a mode on the settings where you can interrupt it (so it's set to listen all the time even when it's speaking - but the interrupt feature works because if you have this mode on which is mean to be used with headphones, you can even interrupt the ai while it's speak ) If you could get the conversation mode to work more like either of those this would be the killer app - you get to pick the LLM you want, you get to customize things, and you can have 2 way seemless conversation back and forth with just about any LLM that there is especially with all the choices on something like openrouter.ai - it would be very very cool to be able to have smooth conversations with just about any LLM out there - using your software and smooth conversational ai - it'd really get to be like the movie Her. Great job on this software! One other thing - as it's implemented now - when in a "call" it didn't consistently play the speech responses - it was like hit or miss - sometime it would speak what the ai was saying back and other times it wouldn't. it always displayed the response - but every other time it didn't speak the response...

@dagelf
Copy link

dagelf commented Jun 25, 2024

This is very nifty, and almost anyone can set it up (as long as they use Google Chome on Desktop)

But... any niceness gets erased when you have a great or funny conversation that's almost impossible to repeat, that you want to screenshot or record... and then you resize the window only to run into this:

Changing the voice will also restart the chat

Yes, resizing the window too. Really?!

At the end of a call, it is not summarized or appended to the chat history (just yet)

Come on!!! What the hell?
...... Sigh. Okay now you've forced me to take a look for myself to see why this simple functionality is so hard that it's not here yet.... 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Committed
Development

No branches or pull requests

8 participants