Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS
The tree bigvgan-mix-v2 has good audio quality
The tree RoFormer-HiFTNet has fast infer speed
No More Upgrade
- This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
- This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
- This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
- This project will not develop one-click packages for other purposes;
-
A minimum VRAM requirement of 6GB for training
-
Support for multiple speakers
-
Create unique speakers through speaker mixing
-
It can even convert voices with light accompaniment
-
You can edit F0 using Excel
AI_Elysia_LoveStory.mp4
Powered by @ShadowVap
Feature | From | Status | Function |
---|---|---|---|
whisper | OpenAI | ✅ | strong noise immunity |
bigvgan | NVIDA | ✅ | alias and snake |
natural speech | Microsoft | ✅ | reduce mispronunciation |
neural source-filter | Xin Wang | ✅ | solve the problem of audio F0 discontinuity |
pitch quantization | Xin Wang | ✅ | quantize the F0 for embedding |
speaker encoder | ✅ | Timbre Encoding and Clustering | |
GRL for speaker | Ubisoft | ✅ | Preventing Encoder Leakage Timbre |
SNAC | Samsung | ✅ | One Shot Clone of VITS |
SCLN | Microsoft | ✅ | Improve Clone |
Diffusion | HuaWei | ✅ | Improve sound quality |
PPG perturbation | this project | ✅ | Improved noise immunity and de-timbre |
HuBERT perturbation | this project | ✅ | Improved noise immunity and de-timbre |
VAE perturbation | this project | ✅ | Improve sound quality |
MIX encoder | this project | ✅ | Improve conversion stability |
USP infer | this project | ✅ | Improve conversion stability |
HiFTNet | Columbia University | ✅ | NSF-iSTFTNet for speed up |
RoFormer | Zhuiyi Technology | ✅ | Rotary Positional Embeddings |
due to the use of data perturbation, it takes longer to train than other projects.
USP : Unvoice and Silence with Pitch when infer
-
Install PyTorch.
-
Install project dependencies
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
-
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
. -
Download whisper model whisper-large-v2. Make sure to download
large-v2.pt
,put it intowhisper_pretrain/
. -
Download hubert_soft model,put
hubert-soft-0d54a1f4.pt
intohubert_pretrain/
. -
Download pitch extractor crepe full,put
full.pth
intocrepe/assets
.Note: crepe full.pth is 84.9 MB, not 6kb
-
Download pretrain model sovits5.0.pretrain.pth, and put it into
vits_pretrain/
.python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
Necessary pre-processing:
- Separate voice and accompaniment with UVR (skip if no accompaniment)
- Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
- Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
- Adjust loudness if necessary, recommend Adobe Audiiton.
- Put the dataset into the
dataset_raw
directory following the structure below.
dataset_raw
├───speaker0
│ ├───000001.wav
│ ├───...
│ └───000xxx.wav
└───speaker1
├───000001.wav
├───...
└───000xxx.wav
python svc_preprocessing.py -t 2
-t
: threading, max number should not exceed CPU core count, usually 2 is enough.
After preprocessing you will get an output with following structure.
data_svc/
└── waves-16k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── waves-32k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── pitch
│ └── speaker0
│ │ ├── 000001.pit.npy
│ │ └── 000xxx.pit.npy
│ └── speaker1
│ ├── 000001.pit.npy
│ └── 000xxx.pit.npy
└── hubert
│ └── speaker0
│ │ ├── 000001.vec.npy
│ │ └── 000xxx.vec.npy
│ └── speaker1
│ ├── 000001.vec.npy
│ └── 000xxx.vec.npy
└── whisper
│ └── speaker0
│ │ ├── 000001.ppg.npy
│ │ └── 000xxx.ppg.npy
│ └── speaker1
│ ├── 000001.ppg.npy
│ └── 000xxx.ppg.npy
└── speaker
│ └── speaker0
│ │ ├── 000001.spk.npy
│ │ └── 000xxx.spk.npy
│ └── speaker1
│ ├── 000001.spk.npy
│ └── 000xxx.spk.npy
└── singer
│ ├── speaker0.spk.npy
│ └── speaker1.spk.npy
|
└── indexes
├── speaker0
│ ├── some_prefix_hubert.index
│ └── some_prefix_whisper.index
└── speaker1
├── hubert.index
└── whisper.index
-
Re-sampling
- Generate audio with a sampling rate of 16000Hz in
./data_svc/waves-16k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
- Generate audio with a sampling rate of 32000Hz in
./data_svc/waves-32k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
- Generate audio with a sampling rate of 16000Hz in
-
Use 16K audio to extract pitch
python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
-
Use 16K audio to extract ppg
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
-
Use 16K audio to extract hubert
python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
-
Use 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
-
Extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
-
Use 32k audio to extract the linear spectrum
python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
-
Use 32k audio to generate training index
python prepare/preprocess_train.py
-
Training file debugging
python prepare/preprocess_zzz.py
-
If fine-tuning is based on the pre-trained model, you need to download the pre-trained model: sovits5.0.pretrain.pth. Put pretrained model under project root, change this line
pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
in
configs/base.yaml
,and adjust the learning rate appropriately, eg 5e-5.batch_size
: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower. -
Start training
python svc_trainer.py -c configs/base.yaml -n sovits5.0
-
Resume training
python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.pt
-
Log visualization
tensorboard --logdir logs/
-
Export inference model: text encoder, Flow network, Decoder network
python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
-
Inference
- if there is no need to adjust
f0
, just run the following command.
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
- if
f0
will be adjusted manually, follow the steps:- use whisper to extract content encoding, generate
test.vec.npy
.
python whisper/inference.py -w test.wav -p test.ppg.npy
- use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage
python hubert/inference.py -w test.wav -v test.vec.npy
- extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
python pitch/inference.py -w test.wav -p test.csv
- final inference
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0
- use whisper to extract content encoding, generate
- if there is no need to adjust
-
Notes
-
when
--ppg
is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted; -
when
--vec
is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted; -
when
--pit
is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted; -
generate files in the current directory:svc_out.wav
-
-
Arguments ref
args --config --model --spk --wave --ppg --vec --pit --shift name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shift -
post by vad
python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_out_post.wav
To increase the stability of the generated timbre, you can use the method described in the Retrieval-based-Voice-Conversion repository. This method consists of 2 steps:
-
Training the retrieval index on hubert and whisper features Run training with default settings:
python svc_train_retrieval.py
If the number of vectors is more than 200_000 they will be compressed to 10_000 using the MiniBatchKMeans algorithm. You can change these settings using command line options:
usage: crate faiss indexes for feature retrieval [-h] [--debug] [--prefix PREFIX] [--speakers SPEAKERS [SPEAKERS ...]] [--compress-features-after COMPRESS_FEATURES_AFTER] [--n-clusters N_CLUSTERS] [--n-parallel N_PARALLEL] options: -h, --help show this help message and exit --debug --prefix PREFIX add prefix to index filename --speakers SPEAKERS [SPEAKERS ...] speaker names to create an index. By default all speakers are from data_svc --compress-features-after COMPRESS_FEATURES_AFTER If the number of features is greater than the value compress feature vectors using MiniBatchKMeans. --n-clusters N_CLUSTERS Number of centroids to which features will be compressed --n-parallel N_PARALLEL Nuber of parallel job of MinibatchKmeans. Default is cpus-1
Compression of training vectors can speed up index inference, but reduces the quality of the retrieve. Use vector count compression if you really have a lot of them.
The resulting indexes will be stored in the "indexes" folder as:
data_svc ... └── indexes ├── speaker0 │ ├── some_prefix_hubert.index │ └── some_prefix_whisper.index └── speaker1 ├── hubert.index └── whisper.index
-
At the inference stage adding the n closest features in a certain proportion of the vits model Enable Feature Retrieval with settings:
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0 \ --enable-retrieval \ --retrieval-ratio 0.5 \ --n-retrieval-vectors 3
For a better retrieval effect, you can try to cycle through different parameters:
--retrieval-ratio
and--n-retrieval-vectors
If you have multiple sets of indexes, you can specify a specific set via the parameter:
--retrieval-index-prefix
You can explicitly specify the paths to the hubert and whisper indexes using the parameters:
--hubert-index-path
and--whisper-index-path
named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction
python svc_eva.py
eva_conf = {
'./configs/singers/singer0022.npy': 0,
'./configs/singers/singer0030.npy': 0,
'./configs/singers/singer0047.npy': 0.5,
'./configs/singers/singer0051.npy': 0.5,
}
the generated singer file will be eva.spk.npy
.
https://github.com/facebookresearch/speech-resynthesis paper
https://github.com/jaywalnut310/vits paper
https://github.com/openai/whisper/ paper
https://github.com/NVIDIA/BigVGAN paper
https://github.com/mindslab-ai/univnet paper
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS
https://github.com/brentspell/hifi-gan-bwe
https://github.com/mozilla/TTS
https://github.com/bshall/soft-vc
https://github.com/maxrmorrison/torchcrepe
https://github.com/MoonInTheRiver/DiffSinger
https://github.com/OlaWod/FreeVC paper
https://github.com/yl4579/HiFTNet paper
Autoregressive neural f0 model for statistical parametric speech synthesis
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation
Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Multilingual Speech Synthesis and Cross-Language Voice Cloning: GRL
RoFormer: Enhanced Transformer with rotary position embedding
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
https://github.com/Francis-Komizu/Sovits
2022.04.12 https://mp.weixin.qq.com/s/autNBYCsG4_SvWt2-Ll_zA
2022.04.22 https://github.com/PlayVoice/VI-SVS
2022.07.26 https://mp.weixin.qq.com/s/qC4TJy-4EVdbpvK2cQb1TA
2022.09.08 https://github.com/PlayVoice/VI-SVC