server : POC OAI-compat TTS using OuteTTS #11070

ngxson · 2025-01-03T22:42:48Z

This is a POC that provides OAI-compat /v1/audio/speech endpoint to server using OuteTTS model.

To use it (the --path is optional, only needed if you want to use the TTS web UI):

./build/bin/llama-server --tts-oute-default -c 4096 --path ./examples/server/public_tts

Then access the UI via http://127.0.0.1:8080/

this_is_a_poc_that_provides_open.mp4

TODO in the future:

Add other export file format, for example mp3, ogg
Expose more controls over the generation (i.e. speed, other voices,...)
Add tests (currently we have no way to test audio files)

ngxson · 2025-01-04T11:38:16Z

@ggerganov Not sure if you will want to merge it to master, or you prefer to keep it as a demo (as the TTS is very POC for now). Feel free to discuss and push commits directly to this PR if needed!

sorasoras · 2025-01-04T21:12:31Z

A OAI endpoint would be quite useful.

ggerganov · 2025-01-06T11:46:22Z

@ngxson Very cool! This approach avoids the need to get the codes from one server (the LLM) and send them to another server (the WavTokenizer).

I think before we merge this, we have to make a PoC for streamed TTS generation and see how it would fit into this implementation. I am not sure yet how to implement streaming - I think it requires some sort of chunking the codes and "stitching" the spectrograms. But it's something that might require running the decoder and the vocoder in parallel and this would probably need to have 2 server instances to achieve this.

ngxson · 2025-01-06T12:02:05Z

I think it requires some sort of chunking the codes and "stitching" the spectrograms

I think it can still be archived using single server instance. The main thing that I don't really understand for now is that does the vocoder decoding stage (converting codes to spec) can be chunked (i.e. for now, the whole thing must be run in a single batch inside tts_get_embd).

If it is possible to do so, then we can chunk N codes each time and send it to SERVER_TASK_TYPE_TTS_EMBD (i.e. tts_get_embd) then stream that output (chunk of N spectrograms) to the response.

ggerganov · 2025-01-06T12:34:36Z

The main thing that I don't really understand for now is that does the vocoder decoding stage (converting codes to spec) can be chunked (i.e. for now, the whole thing must be run in a single batch inside tts_get_embd).

On first look, it seems that it requires the full set of codes because the attention of the vocoder is non-causal. But my guess is that regardless of that, it can be chunked. There will likely be "edge effects" at the chunk boundaries that have to be smoothed out in some way, otherwise there will be noise artifacts in the final audio.

ngxson · 2025-01-06T12:51:14Z

Another idea could be having overlapped codes and "stitching" them together (as you said above). In Adobe there's an effect called "constant power" that can do that in a near perfect manner, but I'm not (yet) sure how it translates into math / cpp code.

ngxson · 2025-01-06T20:39:22Z

I managed to make another POC for streaming with the idea above. Each chunk contains single word, i.e. codes from <|code_start|> to <|code_end|>. Currently, only one of the 2 models (text-to-code or vocoder) can be run at a given time, but I think we can spin a new thread with a queue for vocoder in the future, so the 2 models can run in parallel.

The output sound has the "edge effect" as you mentioned. Overlapping chunks may help, but I'm not sure if will negatively impact the performance.

i_managed_to_make_another_poc_fo.mp4

Here is the link to the diff: https://github.com/ggerganov/llama.cpp/compare/master...ngxson:llama.cpp:xsn/server_tts_streamed?expand=1

ggerganov · 2025-01-07T11:06:46Z

The simplest stitching that can be done is like this:

Let's say that your implementation chunks the codes at the word boundaries like this:

# current
codes: 000000000000000000
                         11111111
                                 222222
                                       333333333333 ...
audio: aaaaaaaaaaaaaaaaaa
                         bbbbbbbb
                                 cccccc
                                       dddddddddddd ...
final: aaaaaaaaaaaaaaaaaabbbbbbbbccccccdddddddddddd ...

Instead, we can chunk on groups of 2 words and mix the resulting audios at the edges with an appropriate interpolation function:

# stitched
codes: 00000000000000000011111111
                         11111111222222
                                 222222333333333333 ...   
audio: aaaaaaaaaaaaaaaaaab‾\_
                         __/‾bbbbc‾\_
                                 __/‾ccddddddddd‾\_ ...
final: aaaaaaaaaaaaaaaaaaBBBBbbbbCCCCccdddddddddDDD ...

# (capital letters indicate mixed audio)

This should fix the noise artifacts and should not increase the processing dramatically. Right now, the FFT code is very slow, but this can become multiple times faster with a proper implementation, so it's not something to worry about.

server : POC OAI-compat TTS using OuteTTS

3744bc4

github-actions bot added examples server labels Jan 3, 2025

ngxson added 2 commits January 4, 2025 12:11

ok

a8153cc

add download btn

e4d3364

ngxson marked this pull request as ready for review January 4, 2025 11:37

ngxson requested a review from ggerganov January 4, 2025 11:37

webui: fix printing timings [no ci]

3cecef7

tts : minor cleanup

f75349c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : POC OAI-compat TTS using OuteTTS #11070

server : POC OAI-compat TTS using OuteTTS #11070

ngxson commented Jan 3, 2025 •

edited

Loading

ngxson commented Jan 4, 2025

sorasoras commented Jan 4, 2025

ggerganov commented Jan 6, 2025 •

edited

Loading

ngxson commented Jan 6, 2025 •

edited

Loading

ggerganov commented Jan 6, 2025 •

edited

Loading

ngxson commented Jan 6, 2025 •

edited

Loading

ngxson commented Jan 6, 2025 •

edited

Loading

ggerganov commented Jan 7, 2025

server : POC OAI-compat TTS using OuteTTS #11070

Are you sure you want to change the base?

server : POC OAI-compat TTS using OuteTTS #11070

Conversation

ngxson commented Jan 3, 2025 • edited Loading

ngxson commented Jan 4, 2025

sorasoras commented Jan 4, 2025

ggerganov commented Jan 6, 2025 • edited Loading

ngxson commented Jan 6, 2025 • edited Loading

ggerganov commented Jan 6, 2025 • edited Loading

ngxson commented Jan 6, 2025 • edited Loading

ngxson commented Jan 6, 2025 • edited Loading

ggerganov commented Jan 7, 2025

ngxson commented Jan 3, 2025 •

edited

Loading

ggerganov commented Jan 6, 2025 •

edited

Loading

ngxson commented Jan 6, 2025 •

edited

Loading

ggerganov commented Jan 6, 2025 •

edited

Loading

ngxson commented Jan 6, 2025 •

edited

Loading

ngxson commented Jan 6, 2025 •

edited

Loading