Skip to content

Latest commit

 

History

History
142 lines (114 loc) · 12.5 KB

README.md

File metadata and controls

142 lines (114 loc) · 12.5 KB

Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation

Authors    Heeseung Kim1, Soonshin Seo2, Kyeongseok Jeong2, Ohsung Kwon2, Soyoon Kim2, Jungwhan Kim2, Jaehong Lee2, Eunwoo Song2, Myungwoo Oh2, Jung-Woo Ha2, Sungroh Yoon1,†, Kang Min Yoo2,†
         1Seoul National University, 2NAVER Cloud, Corresponding Authors


Abstract

Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-of-reasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. Our code and checkpoints are available at https://github.com/naver-ai/usdm.


Overview

Overview

Our spoken dialog framework is structured in three parts:

  1. Speech Tokenizer

    • We use speech tokens (vocab size = 10,000) from SeamlessM4T, capturing essential semantic and paralinguistic information while minimizing vocabulary size for better generalization. For details and audio samples, see our project page and paper.
  2. Spoken Dialog Model (USDM)

    • Our model converts user speech tokens into response tokens using a two-stage training approach:
      1. Cross-modal training on a pre-trained LLM (Mistral-7B-v0.1) to handle text and speech.
      2. Fine-tuning for spoken dialog on DailyTalk.
    • The DailyTalk-trained model supports two speakers; for broader scenarios, consider datasets like Fisher and Switchboard. For new tasks, load the pre-trained checkpoint and fine-tune as needed.
  3. Speech Decoder

    • The decoder converts model output into speech using a Voicebox-based mel-spectrogram generator and BigVGAN for speech synthesis. We trained the Voicebox model, while BigVGAN uses its official checkpoint.

We provide the following resources:

For additional information, please refer to our paper.


Checkpoints in 🤗 Hugging Face Hub

We provide pre-trained models through Hugging Face Hub. The speech tokenizer and vocoder are automatically downloaded via SeamlessM4T and Hugging Face Hub.

Repo Name Usage
naver-ai/xlsr-token-Voicebox Token-Voicebox: Converts speech tokens to mel-spectrograms, supporting personalization via reference mel-spectrograms.
naver-ai/USTM Unified Speech-Text Model (USTM): Further-trained Mistral-7B-v0.1 on speech-text data for customizable fine-tuning on new tasks.
naver-ai/USDM-DailyTalk Unified Spoken Dialog Model (USDM-DailyTalk): Fine-tuned USTM on DailyTalk with a single-turn spoken dialog setup.

Demo and Inference

We provide a Streamlit demo. We also provide inference scripts based on transformers and vLLM libraries. Follow these steps to set up the environment (tested on CUDA V12.4.131, Python 3.10.15, Conda 24.5.0):

# Step 1: Create and activate a new conda environment
conda create -n usdm python=3.10.15
conda activate usdm

# Step 2: Install common dependencies
conda install -c conda-forge libsndfile=1.0.31
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install fairseq2 --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/pt2.2.1/cu121
pip install .
pip install flash-attn==2.6.3 --no-build-isolation

Our default generation strategy is set to greedy search. Feel free to experiment with different sampling strategies by modifying sampling parameters (top_k, top_p, temperature, num_beams, ...) to explore various output samples!

Note

  • The released checkpoint of USDM is trained on DailyTalk using a single-turn template. For testing, we recommend using samples from the speakers in DailyTalk as inputs. If you wish to work with different speakers for input and output, we suggest fine-tuning our provided pre-trained speech-text model on your specific dataset.
  • When you run the code, the checkpoints will be automatically downloaded to YOUR_MODEL_CACHE_DIR (excluding the unit extractor). You can change YOUR_MODEL_CACHE_DIR to save checkpoints to your preferred location.

Running Streamlit Demo

To start Streamlit, use the following command:

MODEL_CACHE_DIR=YOUR_MODEL_CACHE_DIR CUDA_VISIBLE_DEVICES=gpu_id streamlit run src/streamlit_demo.py --server.port YOUR_PORT

Access the demo at localhost:YOUR_PORT.

Note: USDM is trained on DailyTalk and may perform less optimally for voices outside of this dataset.

Standard Inference

# transformers-based (model.generate())
CUDA_VISIBLE_DEVICES=gpu_id python src/inference.py --input_path INPUT_WAV_PATH --output_path PATH_TO_SAVE_SPOKEN_RESPONSE --model_cache_dir YOUR_MODEL_CACHE_DIR

# vLLM-based
CUDA_VISIBLE_DEVICES=gpu_id python src/inference_vllm.py --input_path INPUT_WAV_PATH --output_path PATH_TO_SAVE_SPOKEN_RESPONSE --model_cache_dir YOUR_MODEL_CACHE_DIR

Speaker-Adaptive Inference (Using Reference Audio)

To generate a response that adapts to a reference speaker’s voice, add the reference_path argument:

# transformers-based (model.generate())
CUDA_VISIBLE_DEVICES=gpu_id python src/inference.py --input_path INPUT_WAV_PATH --reference_path REFERENCE_WAV_PATH --output_path PATH_TO_SAVE_SPOKEN_RESPONSE --model_cache_dir YOUR_MODEL_CACHE_DIR

# vLLM-based
CUDA_VISIBLE_DEVICES=gpu_id python src/inference_vllm.py --input_path INPUT_WAV_PATH --reference_path REFERENCE_WAV_PATH --output_path PATH_TO_SAVE_SPOKEN_RESPONSE --model_cache_dir YOUR_MODEL_CACHE_DIR

Note: To generate spoken responses in the desired speaker's voice, our model requires 1. reference audio from the target speaker, and 2. the target speaker's speech token to be generated through USDM. The current single-turn USDM model, fine-tuned on DailyTalk, is optimized for two specific speakers and generates speech tokens for these speakers, which may limit its adaptability to other voices.

Potential Solution: To address this, USDM must be trained to robustly generate speech tokens for unseen speakers. This can be achieved by training USDM on a multi-turn, multi-speaker dataset (e.g., Fisher), enabling it to generate speech tokens for unseen speakers by learning to produce consistent speech tokens for the same speaker across multiple turns. For more detailed explanations and examples, please refer to the multi-turn Fisher examples on our project page and Section A.1 of our paper.


BibTeX

@inproceedings{
   kim2024paralinguisticsaware,
   title={Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation},
   author={Heeseung Kim and Soonshin Seo and Kyeongseok Jeong and Ohsung Kwon and Soyoon Kim and Jungwhan Kim and Jaehong Lee and Eunwoo Song and Myungwoo Oh and Jung-Woo Ha and Sungroh Yoon and Kang Min Yoo},
   booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
   year={2024},
   url={https://openreview.net/forum?id=NjewXJUDYq}
}

License

USDM
Copyright (c) 2024-present NAVER Cloud Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

References