Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
apls777 committed Apr 13, 2019
2 parents e52c53e + ab5cb08 commit d8f7c18
Show file tree
Hide file tree
Showing 32 changed files with 2,644 additions and 1,034 deletions.
56 changes: 49 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Tacotron-2:
Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)

This Repository contains additional improvements and attempts over the paper, we thus propose **paper_hparams.py** file which holds the exact hyperparameters to reproduce the paper results without any additional extras.

Suggested **hparams.py** file which is default in use, contains the hyperparameters with extras that proved to provide better results in most cases. Feel free to toy with the parameters as needed.

DIFFERENCES WILL BE HIGHLIGHTED IN DOCUMENTATION SHORTLY.


# Repository Structure:
Tacotron-2
Expand All @@ -20,14 +26,25 @@ Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network archit
│  │  └── wavs
│   ├── mel-spectrograms
│   ├── plots
│   ├── pretrained
│   ├── taco_pretrained
│   ├── metas
│   └── wavs
├── logs-Wavenet (4)
│   ├── eval-dir
│   │  ├── plots
│  │  └── wavs
│   ├── plots
│   ├── pretrained
│   ├── wave_pretrained
│   ├── metas
│   └── wavs
├── logs-Tacotron-2 ( * )
│   ├── eval-dir
│   │  ├── plots
│  │  └── wavs
│   ├── plots
│   ├── taco_pretrained
│   ├── wave_pretrained
│   ├── metas
│   └── wavs
├── papers
├── tacotron
Expand Down Expand Up @@ -60,6 +77,8 @@ The previous tree shows the current state of the repository (separate training,
- Step **(4)**: Train your Wavenet model. Yield the **logs-Wavenet** folder.
- Step **(5)**: Synthesize audio using the Wavenet model. Gives the **wavenet_output** folder.

- Note: Steps 2, 3, and 4 can be made with a simple run for both Tacotron and WaveNet (Tacotron-2, step ( * )).


Note:
- **Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)!** If running on datasets stored differently, you will probably need to make your own preprocessing script.
Expand Down Expand Up @@ -87,12 +106,33 @@ To have an overview of our advance on this project, please refer to [this discus
since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.

# How to start
first, you need to have python 3 installed along with [Tensorflow](https://www.tensorflow.org/install/).
- **Machine Setup:**

First, you need to have python 3 installed along with [Tensorflow](https://www.tensorflow.org/install/).

next you can install the requirements. If you are an Anaconda user: (else replace **pip** with **pip3** and **python** with **python3**)
Next, you need to install some Linux dependencies to ensure audio libraries work properly:

> apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools
Finally, you can install the requirements. If you are an Anaconda user: (else replace **pip** with **pip3** and **python** with **python3**)

> pip install -r requirements.txt
- **Docker:**

Alternatively, one can build the **docker image** to ensure everything is setup automatically and use the project inside the docker containers.
**Dockerfile is insider "docker" folder**

docker image can be built with:

> docker build -t tacotron-2_image docker/
Then containers are runnable with:

> docker run -i --name new_container tacotron-2_image
Please report any issues with the Docker usage with our models, I'll get to it. Thanks!

# Dataset:
We tested the code above on the [ljspeech dataset](https://keithito.com/LJ-Speech-Dataset/), which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)

Expand All @@ -105,6 +145,8 @@ Before proceeding, you must pick the hyperparameters that suit best your needs.

To pick optimal fft parameters, I have made a **griffin_lim_synthesis_tool** notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the **hparams.py** and have meaningful names so that you can try multiple things with them.

AWAIT DOCUMENTATION ON HPARAMS SHORTLY!!

# Preprocessing
Before running the following steps, please make sure you are inside **Tacotron-2 folder**

Expand Down Expand Up @@ -158,16 +200,16 @@ For the spectrogram prediction network (separately), there are **three types** o

- **Evaluation** (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.

> python synthesize.py --model='Tacotron' --mode='eval'
> python synthesize.py --model='Tacotron'
- **Natural synthesis** (let the model make predictions alone by feeding last decoder output to the next time step).

> python synthesize.py --model='Tacotron' --GTA=False
> python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=False

- **Ground Truth Aligned synthesis** (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)

> python synthesize.py --model='Tacotron' --GTA=True
> python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=True
Synthesizing the **waveforms** conditionned on previously synthesized Mel-spectrograms (separately) can be done with:

Expand Down
106 changes: 93 additions & 13 deletions datasets/audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,10 @@ def save_wav(wav, path, sr):
#proposed by @dsmiller
wavfile.write(path, sr, wav.astype(np.int16))

def save_wavenet_wav(wav, path, sr):
librosa.output.write_wav(path, wav, sr=sr)
def save_wavenet_wav(wav, path, sr, inv_preemphasize, k):
# wav = inv_preemphasis(wav, k, inv_preemphasize)
wav *= 32767 / max(0.01, np.max(np.abs(wav)))
wavfile.write(path, sr, wav.astype(np.int16))

def preemphasis(wav, k, preemphasize=True):
if preemphasize:
Expand Down Expand Up @@ -57,16 +59,18 @@ def get_hop_size(hparams):
return hop_size

def linearspectrogram(wav, hparams):
D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
S = _amp_to_db(np.abs(D), hparams) - hparams.ref_level_db
# D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
D = _stft(wav, hparams)
S = _amp_to_db(np.abs(D)**hparams.magnitude_power, hparams) - hparams.ref_level_db

if hparams.signal_normalization:
return _normalize(S, hparams)
return S

def melspectrogram(wav, hparams):
D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db
# D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
D = _stft(wav, hparams)
S = _amp_to_db(_linear_to_mel(np.abs(D)**hparams.magnitude_power, hparams), hparams) - hparams.ref_level_db

if hparams.signal_normalization:
return _normalize(S, hparams)
Expand All @@ -79,7 +83,7 @@ def inv_linear_spectrogram(linear_spectrogram, hparams):
else:
D = linear_spectrogram

S = _db_to_amp(D + hparams.ref_level_db) #Convert back to linear
S = _db_to_amp(D + hparams.ref_level_db)**(1/hparams.magnitude_power) #Convert back to linear

if hparams.use_lws:
processor = _lws_processor(hparams)
Expand All @@ -97,7 +101,7 @@ def inv_mel_spectrogram(mel_spectrogram, hparams):
else:
D = mel_spectrogram

S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db), hparams) # Convert back to linear
S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db)**(1/hparams.magnitude_power), hparams) # Convert back to linear

if hparams.use_lws:
processor = _lws_processor(hparams)
Expand All @@ -107,6 +111,39 @@ def inv_mel_spectrogram(mel_spectrogram, hparams):
else:
return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)

###########################################################################################
# tensorflow Griffin-Lim
# Thanks to @begeekmyfriend: https://github.com/begeekmyfriend/Tacotron-2/blob/mandarin-new/datasets/audio.py

def inv_linear_spectrogram_tensorflow(spectrogram, hparams):
'''Builds computational graph to convert spectrogram to waveform using TensorFlow.
Unlike inv_spectrogram, this does NOT invert the preemphasis. The caller should call
inv_preemphasis on the output after running the graph.
'''
if hparams.signal_normalization:
D = _denormalize_tensorflow(spectrogram, hparams)
else:
D = linear_spectrogram

S = tf.pow(_db_to_amp_tensorflow(D + hparams.ref_level_db), (1/hparams.magnitude_power))
return _griffin_lim_tensorflow(tf.pow(S, hparams.power), hparams)

def inv_mel_spectrogram_tensorflow(mel_spectrogram, hparams):
'''Builds computational graph to convert mel spectrogram to waveform using TensorFlow.
Unlike inv_mel_spectrogram, this does NOT invert the preemphasis. The caller should call
inv_preemphasis on the output after running the graph.
'''
if hparams.signal_normalization:
D = _denormalize_tensorflow(mel_spectrogram, hparams)
else:
D = mel_spectrogram

S = tf.pow(_db_to_amp_tensorflow(D + hparams.ref_level_db), (1/hparams.magnitude_power))
S = _mel_to_linear_tensorflow(S, hparams) # Convert back to linear
return _griffin_lim_tensorflow(tf.pow(S, hparams.power), hparams)

###########################################################################################

def _lws_processor(hparams):
import lws
return lws.lws(hparams.n_fft, get_hop_size(hparams), fftsize=hparams.win_size, mode="speech")
Expand All @@ -123,11 +160,26 @@ def _griffin_lim(S, hparams):
y = _istft(S_complex * angles, hparams)
return y

def _griffin_lim_tensorflow(S, hparams):
'''TensorFlow implementation of Griffin-Lim
Based on https://github.com/Kyubyong/tensorflow-exercises/blob/master/Audio_Processing.ipynb
'''
with tf.variable_scope('griffinlim'):
# TensorFlow's stft and istft operate on a batch of spectrograms; create batch of size 1
S = tf.expand_dims(S, 0)
S_complex = tf.identity(tf.cast(S, dtype=tf.complex64))
y = tf.contrib.signal.inverse_stft(S_complex, hparams.win_size, get_hop_size(hparams), hparams.n_fft)
for i in range(hparams.griffin_lim_iters):
est = tf.contrib.signal.stft(y, hparams.win_size, get_hop_size(hparams), hparams.n_fft)
angles = est / tf.cast(tf.maximum(1e-8, tf.abs(est)), tf.complex64)
y = tf.contrib.signal.inverse_stft(S_complex * angles, hparams.win_size, get_hop_size(hparams), hparams.n_fft)
return tf.squeeze(y, 0)

def _stft(y, hparams):
if hparams.use_lws:
return _lws_processor(hparams).stft(y).T
else:
return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size, pad_mode='constant')

def _istft(y, hparams):
return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
Expand Down Expand Up @@ -155,11 +207,16 @@ def pad_lr(x, fsize, fshift):
return pad, pad + r
##########################################################
#Librosa correct padding
def librosa_pad_lr(x, fsize, fshift):
'''compute right padding (final frame)
def librosa_pad_lr(x, fsize, fshift, pad_sides=1):
'''compute right padding (final frame) or both sides padding (first and final frames)
'''
return int(fsize // 2)

assert pad_sides in (1, 2)
# return int(fsize // 2)
pad = (x.shape[0] // fshift + 1) * fshift - x.shape[0]
if pad_sides == 1:
return 0, pad
else:
return pad // 2, pad // 2 + pad % 2

# Conversions
_mel_basis = None
Expand All @@ -177,6 +234,12 @@ def _mel_to_linear(mel_spectrogram, hparams):
_inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))
return np.maximum(1e-10, np.dot(_inv_mel_basis, mel_spectrogram))

def _mel_to_linear_tensorflow(mel_spectrogram, hparams):
global _inv_mel_basis
if _inv_mel_basis is None:
_inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))
return tf.transpose(tf.maximum(1e-10, tf.matmul(tf.cast(_inv_mel_basis, tf.float32), tf.transpose(mel_spectrogram, [1, 0]))), [1, 0])

def _build_mel_basis(hparams):
assert hparams.fmax <= hparams.sample_rate // 2
return librosa.filters.mel(hparams.sample_rate, hparams.n_fft, n_mels=hparams.num_mels,
Expand All @@ -189,6 +252,9 @@ def _amp_to_db(x, hparams):
def _db_to_amp(x):
return np.power(10.0, (x) * 0.05)

def _db_to_amp_tensorflow(x):
return tf.pow(tf.ones(tf.shape(x)) * 10.0, x * 0.05)

def _normalize(S, hparams):
if hparams.allow_clipping_in_normalization:
if hparams.symmetric_mels:
Expand Down Expand Up @@ -216,3 +282,17 @@ def _denormalize(D, hparams):
return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)
else:
return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)

def _denormalize_tensorflow(D, hparams):
if hparams.allow_clipping_in_normalization:
if hparams.symmetric_mels:
return (((tf.clip_by_value(D, -hparams.max_abs_value,
hparams.max_abs_value) + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value))
+ hparams.min_level_db)
else:
return ((tf.clip_by_value(D, 0, hparams.max_abs_value) * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)

if hparams.symmetric_mels:
return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)
else:
return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)
27 changes: 19 additions & 8 deletions datasets/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,13 +69,23 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
wav_path))
return None

#Trim lead/trail silences
if hparams.trim_silence:
wav = audio.trim_silence(wav, hparams)

#Pre-emphasize
preem_wav = audio.preemphasis(wav, hparams.preemphasis, hparams.preemphasize)

#rescale wav
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max
preem_wav = preem_wav / np.abs(preem_wav).max() * hparams.rescaling_max

#M-AILABS extra silence specific
if hparams.trim_silence:
wav = audio.trim_silence(wav, hparams)
#Assert all audio is in [-1, 1]
if (wav > 1.).any() or (wav < -1.).any():
raise RuntimeError('wav has invalid value: {}'.format(wav_path))
if (preem_wav > 1.).any() or (preem_wav < -1.).any():
raise RuntimeError('wav has invalid value: {}'.format(wav_path))

#Mu-law quantize
if is_mulaw_quantize(hparams.input_type):
Expand All @@ -85,6 +95,7 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
#Trim silences
start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
wav = wav[start: end]
preem_wav = preem_wav[start: end]
out = out[start: end]

constant_values = mulaw_quantize(0, hparams.quantize_channels)
Expand All @@ -103,14 +114,14 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
out_dtype = np.float32

# Compute the mel scale spectrogram from the wav
mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
mel_spectrogram = audio.melspectrogram(preem_wav, hparams).astype(np.float32)
mel_frames = mel_spectrogram.shape[1]

if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:
return None

#Compute the linear scale spectrogram from the wav
linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)
linear_spectrogram = audio.linearspectrogram(preem_wav, hparams).astype(np.float32)
linear_frames = linear_spectrogram.shape[1]

#sanity check
Expand All @@ -125,10 +136,10 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
else:
#Ensure time resolution adjustement between audio and mel-spectrogram
pad = audio.librosa_pad_lr(wav, hparams.n_fft, audio.get_hop_size(hparams))
l_pad, r_pad = audio.librosa_pad_lr(wav, hparams.n_fft, audio.get_hop_size(hparams), hparams.wavenet_pad_sides)

#Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)
out = np.pad(out, pad, mode='reflect')
#Reflect pad audio signal on the right (Just like it's done in Librosa to avoid frame inconsistency)
out = np.pad(out, (l_pad, r_pad), mode='constant', constant_values=constant_values)

assert len(out) >= mel_frames * audio.get_hop_size(hparams)

Expand Down
Loading

0 comments on commit d8f7c18

Please sign in to comment.