Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG🐛] Streaming output not working #38

Open
tareqalmuntasir7 opened this issue Dec 13, 2024 · 4 comments
Open

[BUG🐛] Streaming output not working #38

tareqalmuntasir7 opened this issue Dec 13, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@tareqalmuntasir7
Copy link

Bug Description

I'm sending a line of text to the TTS with stream=True in TTSRequest. From my understanding of other TTS APIs, the output should come as soon as the model starts generating audio, chunk by chunk. But I'm getting the audio for the whole line at once, taking around ~3 seconds to generate audio for a single sentence. I'm using L4 GPU with 24GB VRAM.

This is increasing the first byte latency which is very crucial to make the TTS work with low latency voice pipeline.

Please note, we have a internal XTTS model serving framework that gives us first byte latency of ~300 ms. But we didn't do that much optimization you are doing. So I was expecting lower or at least similar first byte latency from Auralis.

Minimal Reproducible Example

import time
from auralis import TTS, TTSRequest

tts = TTS().from_pretrained("AstraMindAI/xttsv2", gpt_model='AstraMindAI/xtts2-gpt')

text = """The ancient mountains of the Andes are home to spectacled bears, pumas and the magnificent Andean condor."""

request = TTSRequest(
            text=text,
            speaker_files=["/home/common/Auralis/tests/resources/audio_samples/female.wav"],
            stream=True,
        )

stream = await tts.generate_speech_async(request=request)

is_first = True
start_time = time.perf_counter()

async for chunk in stream:
    if is_first:
        is_first = False
        end_time = time.perf_counter()
        print(f"Time taken: {end_time - start_time} seconds to generate first chunk")
    print(chunk)

Expected Behavior

Receiving the chunks of the generated audio as soon as it's generated.

Actual Behavior

The generated audio for a sentence is coming at once, after significant delay.

Error Logs

08:29:34.455 | XTTSv2.py:55 | ℹ️ INFO     | Initializing XTTSv2Engine...
08:29:35.766 | XTTSv2.py:196 | ℹ️ INFO     | Initializing VLLM engine with args: AsyncEngineArgs(model='AstraMindAI/xtts2-gpt', served_model_name=None, tokenizer='AstraMindAI/xtts2-gpt', task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=True, allowed_local_media_path='', download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=1047, worker_use_ray=False, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.07513329973591869, max_num_batched_tokens=10470, max_num_seqs=10, max_logprobs=20, disable_log_stats=True, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'audio': 1}, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False)
08:29:38.334 | logger.py:65 | ℹ️ INFO     | Downcasting torch.float32 to torch.float16.
08:29:38.340 | logger.py:65 | ⚠️ WARNING  | To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
08:29:38.344 | logger.py:65 | ℹ️ INFO     | Initializing an LLM engine (v0.6.4.post1) with config: model='AstraMindAI/xtts2-gpt', speculative_config=None, tokenizer='AstraMindAI/xtts2-gpt', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1047, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=AstraMindAI/xtts2-gpt, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
08:29:39.898 | logger.py:65 | ℹ️ INFO     | Using Flash Attention backend.
08:29:40.230 | logger.py:65 | ℹ️ INFO     | Starting to load model AstraMindAI/xtts2-gpt...
08:29:40.856 | logger.py:65 | ℹ️ INFO     | Using model weights format ['*.safetensors']
08:29:41.284 | logger.py:65 | ℹ️ INFO     | No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.19it/s]

08:29:42.488 | logger.py:65 | ℹ️ INFO     | Loading model weights took 0.7099 GB
08:29:43.235 | logger.py:65 | ℹ️ INFO     | Memory profiling results: total_gpu_memory=21.95GiB initial_memory_usage=0.96GiB peak_torch_memory=1.00GiB memory_usage_post_profile=1.00GiB non_torch_memory=0.28GiB kv_cache_size=0.37GiB gpu_memory_utilization=0.08
08:29:43.511 | logger.py:65 | ℹ️ INFO     | # GPU blocks: 203, # CPU blocks: 2184
08:29:43.516 | logger.py:65 | ℹ️ INFO     | Maximum concurrency for 1047 tokens per request: 3.10x


Time taken: 3.106173033000232 seconds to generate first chunk
TTSOutput(array=array([1.169e-03, 6.700e-04, 8.821e-04, ..., 1.864e-04, 2.675e-04,
       9.710e-05], dtype=float16), sample_rate=24000, bit_depth=32, bit_rate=192, compression=10, channel=1, start_time=1734079442.5646453, end_time=None, token_length=130)

Environment

Please run the following commands and include the output:

# OS Information
uname -a

 SMP Debian 5.10.226-1 (2024-10-03) x86_64 GNU/Linux

# Python version
python --version
Python 3.10.16

# Installed Python packages
pip list

aiofiles                          24.1.0
aiohappyeyeballs                  2.4.4
aiohttp                           3.11.10
aiosignal                         1.3.1
annotated-types                   0.7.0
anyio                             4.7.0
asttokens                         3.0.0
async-timeout                     5.0.1
attrs                             24.2.0
audioread                         3.0.1
auralis                           0.2.7.post1
beautifulsoup4                    4.12.3
blis                              0.7.11
cachetools                        5.5.0
catalogue                         2.0.10
certifi                           2024.8.30
cffi                              1.17.1
charset-normalizer                3.4.0
click                             8.1.7
cloudpathlib                      0.20.0
cloudpickle                       3.1.0
colorama                          0.4.6
comm                              0.2.2
compressed-tensors                0.8.0
confection                        0.1.5
cutlet                            0.4.0
cymem                             2.0.10
datasets                          3.2.0
debugpy                           1.8.9
decorator                         5.1.1
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
docopt                            0.6.2
EbookLib                          0.18
einops                            0.8.0
exceptiongroup                    1.2.2
executing                         2.1.0
fastapi                           0.115.6
ffmpeg                            1.4
filelock                          3.16.1
frozenlist                        1.5.0
fsspec                            2024.9.0
fugashi                           1.4.0
future                            1.0.0
gguf                              0.10.0
h11                               0.14.0
hangul-romanize                   0.1.0
httpcore                          1.0.7
httptools                         0.6.4
httpx                             0.28.1
huggingface-hub                   0.26.5
idna                              3.10
importlib_metadata                8.5.0
iniconfig                         2.0.0
interegular                       0.3.3
ipykernel                         6.29.5
ipython                           8.30.0
jaconv                            0.4.0
jedi                              0.19.2
Jinja2                            3.1.4
jiter                             0.8.2
joblib                            1.4.2
jsonschema                        4.23.0
jsonschema-specifications         2024.10.1
jupyter_client                    8.6.3
jupyter_core                      5.7.2
langcodes                         3.5.0
langid                            1.1.6
language_data                     1.3.0
lark                              1.2.2
lazy_loader                       0.4
librosa                           0.10.2.post1
llvmlite                          0.43.0
lm-format-enforcer                0.10.9
lxml                              5.3.0
marisa-trie                       1.2.1
markdown-it-py                    3.0.0
MarkupSafe                        3.0.2
matplotlib-inline                 0.1.7
mdurl                             0.1.2
mistral_common                    1.5.1
mojimoji                          0.0.13
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.18.6
multidict                         6.1.0
multiprocess                      0.70.16
murmurhash                        1.0.11
nest-asyncio                      1.6.0
networkx                          3.4.2
num2words                         0.5.13
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.4.5.8
nvidia-cuda-cupti-cu12            12.4.127
nvidia-cuda-nvrtc-cu12            12.4.127
nvidia-cuda-runtime-cu12          12.4.127
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.2.1.3
nvidia-curand-cu12                10.3.5.147
nvidia-cusolver-cu12              11.6.1.9
nvidia-cusparse-cu12              12.3.1.170
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.21.5
nvidia-nvjitlink-cu12             12.4.127
nvidia-nvtx-cu12                  12.4.127
openai                            1.57.3
OpenCC                            1.1.9
opencv-python-headless            4.10.0.84
outlines                          0.0.46
packaging                         24.2
pandas                            2.2.3
parso                             0.8.4
partial-json-parser               0.2.1.1.post4
pexpect                           4.9.0
pillow                            10.4.0
pip                               24.3.1
platformdirs                      4.3.6
pluggy                            1.5.0
pooch                             1.8.2
preshed                           3.0.9
prometheus_client                 0.21.1
prometheus-fastapi-instrumentator 7.0.0
prompt_toolkit                    3.0.48
propcache                         0.2.1
protobuf                          5.29.1
psutil                            6.1.0
ptyprocess                        0.7.0
pure_eval                         0.2.3
py-cpuinfo                        9.0.0
pyairports                        2.1.1
pyarrow                           18.1.0
pycountry                         24.6.1
pycparser                         2.22
pydantic                          2.10.3
pydantic_core                     2.27.1
Pygments                          2.18.0
pyloudnorm                        0.1.1
pypinyin                          0.53.0
pytest                            8.3.4
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
pytz                              2024.2
PyYAML                            6.0.2
pyzmq                             26.2.0
ray                               2.40.0
referencing                       0.35.1
regex                             2024.11.6
requests                          2.32.3
rich                              13.9.4
rpds-py                           0.22.3
safetensors                       0.4.5
scikit-learn                      1.6.0
scipy                             1.14.1
sentencepiece                     0.2.0
setuptools                        75.6.0
shellingham                       1.5.4
six                               1.17.0
smart-open                        7.0.5
sniffio                           1.3.1
sounddevice                       0.5.1
soundfile                         0.12.1
soupsieve                         2.6
soxr                              0.5.0.post1
spacy                             3.7.5
spacy-legacy                      3.0.12
spacy-loggers                     1.0.5
srsly                             2.5.0
stack-data                        0.6.3
starlette                         0.41.3
sympy                             1.13.1
thinc                             8.2.5
threadpoolctl                     3.5.0
tiktoken                          0.7.0
tokenizers                        0.21.0
tomli                             2.2.1
torch                             2.5.1
torchaudio                        2.5.1
torchvision                       0.20.1
tornado                           6.4.2
tqdm                              4.67.1
traitlets                         5.14.3
transformers                      4.47.0
triton                            3.1.0
typer                             0.15.1
typing_extensions                 4.12.2
tzdata                            2024.2
urllib3                           2.2.3
uvicorn                           0.32.1
uvloop                            0.21.0
vllm                              0.6.4.post1
wasabi                            1.1.3
watchfiles                        1.0.3
wcwidth                           0.2.13
weasel                            0.4.1
websockets                        14.1
wheel                             0.45.1
wrapt                             1.17.0
xformers                          0.0.28.post3
xxhash                            3.5.0
yarl                              1.18.3

# GPU Information (if applicable)
nvidia-smi

Fri Dec 13 08:57:41 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:00:03.0 Off |                    0 |
| N/A   63C    P0             33W /   72W |    3103MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    118101      C   /opt/conda/envs/au/bin/python                3080MiB |
+-----------------------------------------------------------------------------------------+

# CUDA version (if applicable)
nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
@tareqalmuntasir7 tareqalmuntasir7 added the bug Something isn't working label Dec 13, 2024
@mlinmg
Copy link
Contributor

mlinmg commented Dec 13, 2024

Te stream is based on sentence per sentence. if you inset longer inputs it will yields in a sentence per sentence settings. if you would use token directly from the gpt genetator the output would have a much lower quality.
From what I image you would get the spectogram and then then stream that by splittign in the time dimesion but I've never seen a code that do it.
Do you have an example of an implementation of time-dimension spectogram batching where we can take inspo from(or how you are doing streaming for the gan part) ?
I'm having difficulties figuring out how to split the spectogram to yield an identical result as if the spec was passed as a unit, since it have some components that are

Edit: By profiling the vocalization part on your quote you can see that it arrives to the synthesis part in 3.2s and it vocalizes it in 100ms, so you should further optimize vllm to reduce ttft to actually be faster in streaming

@mlinmg mlinmg added enhancement New feature or request and removed bug Something isn't working labels Dec 13, 2024
@mlinmg
Copy link
Contributor

mlinmg commented Dec 13, 2024

to reduce ttfb further you could use prepare_for_streaming_generation

@jadecannondev
Copy link

@mlinmg how do you recommend you use prepare_for_streaming_generation. Is there a way to also change the chunk size? Currently ttfb is around 3s for me but even then streaming gives the whole 2 sentences at once. It's not streaming chunk by chunk. thank you :)

@mlinmg
Copy link
Contributor

mlinmg commented Jan 7, 2025

Yeah by calling that method you pre-calculate the speaker embeddings and gpt conditioning so you can just use it for the next chunk. We avoided doing actual streaming because we observed that the qaulity of the speech was very poor, you could easily modify the classes to implement it. firstly shift the vllm output to be of type delta and then every x token yields the result to obtain real streaming, however as I said the output quality will probably be way worse than the regular model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants