-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG🐛] Streaming output not working #38
Comments
Te stream is based on sentence per sentence. if you inset longer inputs it will yields in a sentence per sentence settings. if you would use token directly from the gpt genetator the output would have a much lower quality. Edit: By profiling the vocalization part on your quote you can see that it arrives to the synthesis part in 3.2s and it vocalizes it in 100ms, so you should further optimize vllm to reduce ttft to actually be faster in streaming |
to reduce ttfb further you could use prepare_for_streaming_generation |
@mlinmg how do you recommend you use prepare_for_streaming_generation. Is there a way to also change the chunk size? Currently ttfb is around 3s for me but even then streaming gives the whole 2 sentences at once. It's not streaming chunk by chunk. thank you :) |
Yeah by calling that method you pre-calculate the speaker embeddings and gpt conditioning so you can just use it for the next chunk. We avoided doing actual streaming because we observed that the qaulity of the speech was very poor, you could easily modify the classes to implement it. firstly shift the vllm output to be of type delta and then every x token yields the result to obtain real streaming, however as I said the output quality will probably be way worse than the regular model |
Bug Description
I'm sending a line of text to the TTS with
stream=True
inTTSRequest
. From my understanding of other TTS APIs, the output should come as soon as the model starts generating audio, chunk by chunk. But I'm getting the audio for the whole line at once, taking around ~3 seconds to generate audio for a single sentence. I'm using L4 GPU with 24GB VRAM.This is increasing the
first byte latency
which is very crucial to make the TTS work with low latency voice pipeline.Please note, we have a internal XTTS model serving framework that gives us
first byte latency
of ~300 ms. But we didn't do that much optimization you are doing. So I was expecting lower or at least similar first byte latency from Auralis.Minimal Reproducible Example
Expected Behavior
Receiving the chunks of the generated audio as soon as it's generated.
Actual Behavior
The generated audio for a sentence is coming at once, after significant delay.
Error Logs
Environment
Please run the following commands and include the output:
The text was updated successfully, but these errors were encountered: