-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : POC OAI-compat TTS using OuteTTS #11070
base: master
Are you sure you want to change the base?
Conversation
@ggerganov Not sure if you will want to merge it to master, or you prefer to keep it as a demo (as the TTS is very POC for now). Feel free to discuss and push commits directly to this PR if needed! |
A OAI endpoint would be quite useful. |
@ngxson Very cool! This approach avoids the need to get the codes from one server (the LLM) and send them to another server (the WavTokenizer). I think before we merge this, we have to make a PoC for streamed TTS generation and see how it would fit into this implementation. I am not sure yet how to implement streaming - I think it requires some sort of chunking the codes and "stitching" the spectrograms. But it's something that might require running the decoder and the vocoder in parallel and this would probably need to have 2 server instances to achieve this. |
I think it can still be archived using single server instance. The main thing that I don't really understand for now is that does the vocoder decoding stage (converting codes to spec) can be chunked (i.e. for now, the whole thing must be run in a single batch inside If it is possible to do so, then we can chunk N codes each time and send it to |
On first look, it seems that it requires the full set of codes because the attention of the vocoder is non-causal. But my guess is that regardless of that, it can be chunked. There will likely be "edge effects" at the chunk boundaries that have to be smoothed out in some way, otherwise there will be noise artifacts in the final audio. |
Another idea could be having overlapped codes and "stitching" them together (as you said above). In Adobe there's an effect called "constant power" that can do that in a near perfect manner, but I'm not (yet) sure how it translates into math / cpp code. |
I managed to make another POC for streaming with the idea above. Each chunk contains single word, i.e. codes from The output sound has the "edge effect" as you mentioned. Overlapping chunks may help, but I'm not sure if will negatively impact the performance. i_managed_to_make_another_poc_fo.mp4Here is the link to the diff: https://github.com/ggerganov/llama.cpp/compare/master...ngxson:llama.cpp:xsn/server_tts_streamed?expand=1 |
The simplest stitching that can be done is like this: Let's say that your implementation chunks the codes at the word boundaries like this: # current
codes: 000000000000000000
11111111
222222
333333333333 ...
audio: aaaaaaaaaaaaaaaaaa
bbbbbbbb
cccccc
dddddddddddd ...
final: aaaaaaaaaaaaaaaaaabbbbbbbbccccccdddddddddddd ... Instead, we can chunk on groups of 2 words and mix the resulting audios at the edges with an appropriate interpolation function: # stitched
codes: 00000000000000000011111111
11111111222222
222222333333333333 ...
audio: aaaaaaaaaaaaaaaaaab‾\_
__/‾bbbbc‾\_
__/‾ccddddddddd‾\_ ...
final: aaaaaaaaaaaaaaaaaaBBBBbbbbCCCCccdddddddddDDD ...
# (capital letters indicate mixed audio) This should fix the noise artifacts and should not increase the processing dramatically. Right now, the FFT code is very slow, but this can become multiple times faster with a proper implementation, so it's not something to worry about. |
This is a POC that provides OAI-compat /v1/audio/speech endpoint to server using OuteTTS model.
To use it (the
--path
is optional, only needed if you want to use the TTS web UI):Then access the UI via http://127.0.0.1:8080/
this_is_a_poc_that_provides_open.mp4
TODO in the future: