Skip to content

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

License

Notifications You must be signed in to change notification settings

cadia-lvl/FastSpeech2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastSpeech 2 - PyTorch Implementation

This is a fork of the original repository to train and run a model for Icelandic. For instructions on how to run the icelandic version look at README-is.md

This is a PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. This project is based on xcmyz's implementation of FastSpeech. Feel free to use/modify the code. Any suggestion for improvement is appreciated.

This repository contains only FastSpeech 2 but FastSpeech 2s so far. I will update it once I reproduce FastSpeech 2s, the end-to-end version of FastSpeech2, successfully.

Audio Samples

Audio samples generated by this implementation can be found here.

Quickstart

Dependencies

You can install the python dependencies with

pip3 install -r requirements.txt

Synthesis

You have to download our FastSpeech2 pretrained model and put it in the ckpt/LJSpeech/ directory.

Your can run

python3 synthesis.py --step 300000

to generate any desired utterances. The generated utterances will be put in the results/ directory.

Here is a generated spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition"

For CPU inference please refer to this colab tutorial. One has to clone the original repo of MelGAN instead of using torch.hub due to the code architecture of MelGAN.

Controllability

The duration/pitch/energy of the synthesized utterances can be modified by specifying the desired duration/pitch/energy ratio to the predicted values. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesis.py --step 300000 --duration_control 0.8 --energy_control 0.8

Docker

This code is configured to run in a docker container, both including CUDA support or without it - only on CPU.

First build the container

docker build -f runtime.Dockerfile --tag faststpeeh2:latest .

Then run with

docker run -v /full/path/to/models:/opt/fastspeech2/models -v /full/path/to/results:/opt/fastspeech2/results fastspeech2:latest

Training

Datasets

We use the LJSpeech English dataset, which consists of 13100 short audio clips of a single female speaker reading passages from 7 non-fiction books, approximately 24 hours in total, to train the entire model end-to-end.

After downloading the dataset and extracting the compressed files, you have to modify the hp.data_path and some other parameters in hparams.py. Default parameters are for the LJSpeech dataset.

Preprocessing

As described in the paper, Montreal Forced Aligner(MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech dataset is provided here. You have to put the TextGrid.zip file in your hp.preprocessed_path/ and extract the files before you continue.

After that, run the preprocessing script by

python3 preprocess.py

Alternately, you can align the corpus by yourself. First download the MFA package and the pretrained lexicon file. (We use LibriSpeech lexicon instead of the G2p_en python package proposed in the paper)

wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz
tar -zxvf montreal-forced-aligner_linux.tar.gz

wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -O montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt

Then prepare some necessary files required by the MFA.

python3 prepare_align.py

Run the MFA and put the .TextGrid files in your hp.preprocessed_path.

# Replace $DATA_PATH and $PREPROCESSED_PATH with ./LJSpeech-1.1/wavs and ./preprocessed/LJSpeech/TextGrid, for example
./montreal-forced-aligner/bin/mfa_align $YOUR_DATA_PATH montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt english $YOUR_PREPROCESSED_PATH -j 8

And remember to run the preprocessing script.

python3 preprocess.py

After preprocessing, you will get a stat.txt file in your hp.preprocessed_path/, recording the maximum and minimum values of the fundamental frequency and energy values throughout the entire corpus. You have to modify the f0 and energy parameters in the hparams.py according to the content of stat.txt.

Training

Train your model with

python3 train.py

The model takes less than 10k steps (less than 1 hour on my GTX1080 GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.

There might be some room for improvement for this repository. For example, I just simply add up the duration loss, f0 loss, energy loss and mel loss without any weighting.

TensorBoard

The TensorBoard loggers are stored in the log/hp.dataset/ directory. Use

tensorboard --logdir log/hp.dataset/

to serve the TensorBoard on your localhost. Here is an example training the model on LJSpeech for 400k steps.

Notes

Implementation Issues

There are several differences between my implementation and the paper.

  • The paper includes punctuations in the transcripts. However, MFA discards punctuations by default and I haven't found a way to solve it. During inference, I replace all punctuations with the sp (short-pause) phone labels.
  • Following xcmyz's implementation, I use an additional Tacotron-2-styled postnet after the FastSpeech decoder, which is not used in the original paper.
  • The transformer paper suggests to use dropout after the input and positional embedding. I find that this trick does not make any observable difference so I do not use dropout for positional embedding.
  • The paper suggest to use L1 loss for mel loss and L2 loss for variance predictor losses. But I find it easier to train the model with L2 mel loss and L1 variance adaptor losses.
  • Gradient clipping is used in the training.

Some tips for training this model.

  • You can set the hp.acc_steps parameter if you wish to train with a large batchsize on a GPU with limited memory.
  • In my experience, carefully masking out the padded parts in loss computation and in model forward parts largely improves the performance.

Please inform me if you find any mistake in this repo, or any useful tip to train the FastSpeech2 model.

TODO

  • Try different weights for the loss terms.
  • Evaluate the quality of the synthesized audio over the validation set.
  • Multi-speaker, voice cloning, or transfer learning experiment.
  • Implement FastSpeech 2s.

References

About

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 88.8%
  • HTML 10.7%
  • Dockerfile 0.5%