Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : POC OAI-compat TTS using OuteTTS #11070

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 3, 2025

This is a POC that provides OAI-compat /v1/audio/speech endpoint to server using OuteTTS model.

To use it (the --path is optional, only needed if you want to use the TTS web UI):

./build/bin/llama-server --tts-oute-default -c 4096 --path ./examples/server/public_tts

Then access the UI via http://127.0.0.1:8080/

image
this_is_a_poc_that_provides_open.mp4

TODO in the future:

  • Add other export file format, for example mp3, ogg
  • Expose more controls over the generation (i.e. speed, other voices,...)
  • Add tests (currently we have no way to test audio files)

@ngxson ngxson marked this pull request as ready for review January 4, 2025 11:37
@ngxson ngxson requested a review from ggerganov January 4, 2025 11:37
@ngxson
Copy link
Collaborator Author

ngxson commented Jan 4, 2025

@ggerganov Not sure if you will want to merge it to master, or you prefer to keep it as a demo (as the TTS is very POC for now). Feel free to discuss and push commits directly to this PR if needed!

@sorasoras
Copy link

A OAI endpoint would be quite useful.

@ggerganov
Copy link
Owner

ggerganov commented Jan 6, 2025

@ngxson Very cool! This approach avoids the need to get the codes from one server (the LLM) and send them to another server (the WavTokenizer).

I think before we merge this, we have to make a PoC for streamed TTS generation and see how it would fit into this implementation. I am not sure yet how to implement streaming - I think it requires some sort of chunking the codes and "stitching" the spectrograms. But it's something that might require running the decoder and the vocoder in parallel and this would probably need to have 2 server instances to achieve this.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 6, 2025

I think it requires some sort of chunking the codes and "stitching" the spectrograms

I think it can still be archived using single server instance. The main thing that I don't really understand for now is that does the vocoder decoding stage (converting codes to spec) can be chunked (i.e. for now, the whole thing must be run in a single batch inside tts_get_embd).

If it is possible to do so, then we can chunk N codes each time and send it to SERVER_TASK_TYPE_TTS_EMBD (i.e. tts_get_embd) then stream that output (chunk of N spectrograms) to the response.

@ggerganov
Copy link
Owner

ggerganov commented Jan 6, 2025

The main thing that I don't really understand for now is that does the vocoder decoding stage (converting codes to spec) can be chunked (i.e. for now, the whole thing must be run in a single batch inside tts_get_embd).

On first look, it seems that it requires the full set of codes because the attention of the vocoder is non-causal. But my guess is that regardless of that, it can be chunked. There will likely be "edge effects" at the chunk boundaries that have to be smoothed out in some way, otherwise there will be noise artifacts in the final audio.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 6, 2025

Another idea could be having overlapped codes and "stitching" them together (as you said above). In Adobe there's an effect called "constant power" that can do that in a near perfect manner, but I'm not (yet) sure how it translates into math / cpp code.

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 6, 2025

I managed to make another POC for streaming with the idea above. Each chunk contains single word, i.e. codes from <|code_start|> to <|code_end|>. Currently, only one of the 2 models (text-to-code or vocoder) can be run at a given time, but I think we can spin a new thread with a queue for vocoder in the future, so the 2 models can run in parallel.

The output sound has the "edge effect" as you mentioned. Overlapping chunks may help, but I'm not sure if will negatively impact the performance.

i_managed_to_make_another_poc_fo.mp4

Here is the link to the diff: https://github.com/ggerganov/llama.cpp/compare/master...ngxson:llama.cpp:xsn/server_tts_streamed?expand=1

@ggerganov
Copy link
Owner

The simplest stitching that can be done is like this:

Let's say that your implementation chunks the codes at the word boundaries like this:

# current
codes: 000000000000000000
                         11111111
                                 222222
                                       333333333333 ...
audio: aaaaaaaaaaaaaaaaaa
                         bbbbbbbb
                                 cccccc
                                       dddddddddddd ...
final: aaaaaaaaaaaaaaaaaabbbbbbbbccccccdddddddddddd ...

Instead, we can chunk on groups of 2 words and mix the resulting audios at the edges with an appropriate interpolation function:

# stitched
codes: 00000000000000000011111111
                         11111111222222
                                 222222333333333333 ...   
audio: aaaaaaaaaaaaaaaaaab‾\_
                         __/‾bbbbc‾\_
                                 __/‾ccddddddddd‾\_ ...
final: aaaaaaaaaaaaaaaaaaBBBBbbbbCCCCccdddddddddDDD ...

# (capital letters indicate mixed audio)

This should fix the noise artifacts and should not increase the processing dramatically. Right now, the FFT code is very slow, but this can become multiple times faster with a proper implementation, so it's not something to worry about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants