Vtuber Voice Generator Fine-Tuning with GPT-SoVITS Pipeline

This project fine-tunes a voice generator model based on the GPT-SoVITS pipeline using Vtuber-related data. Below are the detailed steps of the process:

0. Install dependences:

  pip install -r requirements.txt

1. Training Data Generation

Audio Downloading:
- Vtuber audio files are downloaded using yt-dlp, along with subtitles and timestamps:
```
yt-dlp --write-subs --all-subs -f bestaudio --extract-audio --audio-format wav --sub-format srt -o "%(title)s.%(ext)s" --cookies-from-browser chrome url
```
- For videos without subtitles, Whisper generates subtitles in SRT format:
```
whisper sample.wav --model small --output_format srt --language Chinese
```
- (Optional) To minimize timestamp errors, primitive slicing is performed to reduce large audio file sizes by running split_audio script in tools/audio_text_folder:
```
python split_audio.py sample.wav output_folder
```

2. Dataset Preparation

Slicing Audio:
A custom slicer is used to cut audio files based on subtitle timestamps. However, some slices may contain long silences at the beginning or end. Please redefine your input folder and output folder in script:
```
  python slice.py
```
Denoising:
Sliced audio files undergo a denoising process to enhance quality:
```
 python denoise.py -i input_folder -o output_folder  
```
Transcription:
Corresponding text files are generated for each audio slice containing the respective transcriptions. This is achieved in the slicing step as well.

3. Feature Generation

3.1 Text Features

Transcribed text is input into a pretrained BERT model (Chinese-Roberta-WWM-Ext-Large) to generate BERT tokens. The model need to be downloaded to pretrained_model folder
The text is also converted to phonemes using pypinyin-g2pW. G2PWModel needs to be downloaded into tools folder in G2PWModel folder.
The extraction step can be done by running:
```
 python 1-dp-get-text.py
```

3.2 Audio Features

Input audio is converted to CN-HuBERT features. We use a pretrained CN-HuBERT model ([chinese-hubert-base]https://huggingface.co/lj1995/GPT-SoVITS/tree/main)
CN-HuBERT features are passed to a pretrained SynthesizerTrn model (S2G488K.pth) to generate semantic tokens.

The extraction step can be done by running step by step:

  python 2-dp-get-hubert-wav32k.py

  python 3-dp-get-semantic.py

4. Fine-Tuning

4.1 SoVITS Training

The model is trained to predict WAV audio outputs from semantic tokens:

The training step is done by:

   python s2_train.py --config "./configs/tmp_s2.json" --exp_dir "./logs/v1_trial"

4.2 GPT Training

The GPT component is trained to predict the next semantic token using:
- Current semantic token
- Phonemes
- BERT features

The training step is done by:

  python s1_train.py --config "configs/tmp_s1.yaml"

5. Inference

The inference_cli:
- Loads fine-tuned weights from GPT-weights folder and SoVITS-weights folder.
- Uses a reference audio and target text to generate the target audio. We had a predefined piece of reference audio in our script, replace with you own reference audio path.
```
  python inference_cli.py --gpt_model ./GPT_weights/gpt_model.ckpt --sovits_model ./SoVITS_weights/sovits_model.pth --target_text ./test.txt  --output_path ./output_folder
```
Observation:
Removing the reference audio significantly reduces the quality of the generated target audio.

6. Evaluation

We use speechmetrics and mel_cepstral_distance for evaluation and comparison of models.

Related Projects

For further details, visit the official GPT-SoVITS GitHub Repository.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
AR		AR
GPT_weights		GPT_weights
Tacotron2		Tacotron2
configs		configs
module		module
pretrained_models		pretrained_models
text		text
tools		tools
.DS_Store		.DS_Store
.gitignore		.gitignore
1-dp-get-text.py		1-dp-get-text.py
2-dp-get-hubert-wav32k.py		2-dp-get-hubert-wav32k.py
3-dp-get-semantic.py		3-dp-get-semantic.py
README.md		README.md
inference_cli.py		inference_cli.py
pinyin_converter.py		pinyin_converter.py
process_ckpt.py		process_ckpt.py
requirements.txt		requirements.txt
s1_train.py		s1_train.py
s2_train.py		s2_train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vtuber Voice Generator Fine-Tuning with GPT-SoVITS Pipeline

0. Install dependences:

1. Training Data Generation

2. Dataset Preparation

3. Feature Generation

3.1 Text Features

3.2 Audio Features

4. Fine-Tuning

4.1 SoVITS Training

4.2 GPT Training

5. Inference

6. Evaluation

Related Projects

About

Releases

Packages

Languages

Ayaka-Yuki/GPT-SoVITS-fine-tuning

Folders and files

Latest commit

History

Repository files navigation

Vtuber Voice Generator Fine-Tuning with GPT-SoVITS Pipeline

0. Install dependences:

1. Training Data Generation

2. Dataset Preparation

3. Feature Generation

3.1 Text Features

3.2 Audio Features

4. Fine-Tuning

4.1 SoVITS Training

4.2 GPT Training

5. Inference

6. Evaluation

Related Projects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages