Merge remote-tracking branch 'upstream/master'

Rayhane-mamah · Apr 13, 2019 · d8f7c18 · d8f7c18
2 parents e52c53e + ab5cb08
commit d8f7c18
Show file tree

Hide file tree

Showing 32 changed files with 2,644 additions and 1,034 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,12 @@
 # Tacotron-2:
 Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
 
+This Repository contains additional improvements and attempts over the paper, we thus propose **paper_hparams.py** file which holds the exact hyperparameters to reproduce the paper results without any additional extras.
+
+Suggested **hparams.py** file which is default in use, contains the hyperparameters with extras that proved to provide better results in most cases. Feel free to toy with the parameters as needed.
+
+DIFFERENCES WILL BE HIGHLIGHTED IN DOCUMENTATION SHORTLY.
+
 
 # Repository Structure:
 	Tacotron-2
@@ -20,14 +26,25 @@ Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network archit
 	│ 	│ 	└── wavs
 	│   ├── mel-spectrograms
 	│   ├── plots
-	│   ├── pretrained
+	│   ├── taco_pretrained
+	│   ├── metas
 	│   └── wavs
 	├── logs-Wavenet	(4)
 	│   ├── eval-dir
 	│   │ 	├── plots
 	│ 	│ 	└── wavs
 	│   ├── plots
-	│   ├── pretrained
+	│   ├── wave_pretrained
+	│   ├── metas
+	│   └── wavs
+	├── logs-Tacotron-2	( * )
+	│   ├── eval-dir
+	│   │ 	├── plots
+	│ 	│ 	└── wavs
+	│   ├── plots
+	│   ├── taco_pretrained
+	│   ├── wave_pretrained
+	│   ├── metas
 	│   └── wavs
 	├── papers
 	├── tacotron
@@ -60,6 +77,8 @@ The previous tree shows the current state of the repository (separate training,
 - Step **(4)**: Train your Wavenet model. Yield the **logs-Wavenet** folder.
 - Step **(5)**: Synthesize audio using the Wavenet model. Gives the **wavenet_output** folder.
 
+- Note: Steps 2, 3, and 4 can be made with a simple run for both Tacotron and WaveNet (Tacotron-2, step ( * )).
+
 
 Note:
 - **Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)!** If running on datasets stored differently, you will probably need to make your own preprocessing script.
@@ -87,12 +106,33 @@ To have an overview of our advance on this project, please refer to [this discus
 since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.
 
 # How to start
-first, you need to have python 3 installed along with [Tensorflow](https://www.tensorflow.org/install/).
+- **Machine Setup:**
+
+First, you need to have python 3 installed along with [Tensorflow](https://www.tensorflow.org/install/).
 
-next you can install the requirements. If you are an Anaconda user: (else replace **pip** with **pip3** and **python** with **python3**)
+Next, you need to install some Linux dependencies to ensure audio libraries work properly:
+
+> apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools
+
+Finally, you can install the requirements. If you are an Anaconda user: (else replace **pip** with **pip3** and **python** with **python3**)
 
 > pip install -r requirements.txt
 
+- **Docker:**
+
+Alternatively, one can build the **docker image** to ensure everything is setup automatically and use the project inside the docker containers.
+**Dockerfile is insider "docker" folder**
+
+docker image can be built with:
+
+> docker build -t tacotron-2_image docker/
+
+Then containers are runnable with:
+
+> docker run -i --name new_container tacotron-2_image
+
+Please report any issues with the Docker usage with our models, I'll get to it. Thanks!
+
 # Dataset:
 We tested the code above on the [ljspeech dataset](https://keithito.com/LJ-Speech-Dataset/), which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)
 
@@ -105,6 +145,8 @@ Before proceeding, you must pick the hyperparameters that suit best your needs.
 
 To pick optimal fft parameters, I have made a **griffin_lim_synthesis_tool** notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the **hparams.py** and have meaningful names so that you can try multiple things with them.
 
+AWAIT DOCUMENTATION ON HPARAMS SHORTLY!!
+
 # Preprocessing
 Before running the following steps, please make sure you are inside **Tacotron-2 folder**
 
@@ -158,16 +200,16 @@ For the spectrogram prediction network (separately), there are **three types** o
 
 - **Evaluation** (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.
 
-> python synthesize.py --model='Tacotron' --mode='eval'
+> python synthesize.py --model='Tacotron'
 
 - **Natural synthesis** (let the model make predictions alone by feeding last decoder output to the next time step).
 
-> python synthesize.py --model='Tacotron' --GTA=False
+> python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=False
 
 
 - **Ground Truth Aligned synthesis** (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)
 
-> python synthesize.py --model='Tacotron' --GTA=True
+> python synthesize.py --model='Tacotron' --mode='synthesis' --GTA=True
 
 Synthesizing the **waveforms** conditionned on previously synthesized Mel-spectrograms (separately) can be done with:
 

diff --git a/datasets/audio.py b/datasets/audio.py
@@ -14,8 +14,10 @@ def save_wav(wav, path, sr):
 	#proposed by @dsmiller
 	wavfile.write(path, sr, wav.astype(np.int16))
 
-def save_wavenet_wav(wav, path, sr):
-	librosa.output.write_wav(path, wav, sr=sr)
+def save_wavenet_wav(wav, path, sr, inv_preemphasize, k):
+	# wav = inv_preemphasis(wav, k, inv_preemphasize)
+	wav *= 32767 / max(0.01, np.max(np.abs(wav)))
+	wavfile.write(path, sr, wav.astype(np.int16))
 
 def preemphasis(wav, k, preemphasize=True):
 	if preemphasize:
@@ -57,16 +59,18 @@ def get_hop_size(hparams):
 	return hop_size
 
 def linearspectrogram(wav, hparams):
-	D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
-	S = _amp_to_db(np.abs(D), hparams) - hparams.ref_level_db
+	# D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
+	D = _stft(wav, hparams)
+	S = _amp_to_db(np.abs(D)**hparams.magnitude_power, hparams) - hparams.ref_level_db
 
 	if hparams.signal_normalization:
 		return _normalize(S, hparams)
 	return S
 
 def melspectrogram(wav, hparams):
-	D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
-	S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db
+	# D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
+	D = _stft(wav, hparams)
+	S = _amp_to_db(_linear_to_mel(np.abs(D)**hparams.magnitude_power, hparams), hparams) - hparams.ref_level_db
 
 	if hparams.signal_normalization:
 		return _normalize(S, hparams)
@@ -79,7 +83,7 @@ def inv_linear_spectrogram(linear_spectrogram, hparams):
 	else:
 		D = linear_spectrogram
 
-	S = _db_to_amp(D + hparams.ref_level_db) #Convert back to linear
+	S = _db_to_amp(D + hparams.ref_level_db)**(1/hparams.magnitude_power) #Convert back to linear
 
 	if hparams.use_lws:
 		processor = _lws_processor(hparams)
@@ -97,7 +101,7 @@ def inv_mel_spectrogram(mel_spectrogram, hparams):
 	else:
 		D = mel_spectrogram
 
-	S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db), hparams)  # Convert back to linear
+	S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db)**(1/hparams.magnitude_power), hparams)  # Convert back to linear
 
 	if hparams.use_lws:
 		processor = _lws_processor(hparams)
@@ -107,6 +111,39 @@ def inv_mel_spectrogram(mel_spectrogram, hparams):
 	else:
 		return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)
 
+###########################################################################################
+# tensorflow Griffin-Lim
+# Thanks to @begeekmyfriend: https://github.com/begeekmyfriend/Tacotron-2/blob/mandarin-new/datasets/audio.py
+
+def inv_linear_spectrogram_tensorflow(spectrogram, hparams):
+	'''Builds computational graph to convert spectrogram to waveform using TensorFlow.
+	Unlike inv_spectrogram, this does NOT invert the preemphasis. The caller should call
+	inv_preemphasis on the output after running the graph.
+	'''
+	if hparams.signal_normalization:
+		D = _denormalize_tensorflow(spectrogram, hparams)
+	else:
+		D = linear_spectrogram
+
+	S = tf.pow(_db_to_amp_tensorflow(D + hparams.ref_level_db), (1/hparams.magnitude_power))
+	return _griffin_lim_tensorflow(tf.pow(S, hparams.power), hparams)
+
+def inv_mel_spectrogram_tensorflow(mel_spectrogram, hparams):
+	'''Builds computational graph to convert mel spectrogram to waveform using TensorFlow.
+	Unlike inv_mel_spectrogram, this does NOT invert the preemphasis. The caller should call
+	inv_preemphasis on the output after running the graph.
+	'''
+	if hparams.signal_normalization:
+		D = _denormalize_tensorflow(mel_spectrogram, hparams)
+	else:
+		D = mel_spectrogram
+
+	S = tf.pow(_db_to_amp_tensorflow(D + hparams.ref_level_db), (1/hparams.magnitude_power))
+	S = _mel_to_linear_tensorflow(S, hparams)  # Convert back to linear
+	return _griffin_lim_tensorflow(tf.pow(S, hparams.power), hparams)
+
+###########################################################################################
+
 def _lws_processor(hparams):
 	import lws
 	return lws.lws(hparams.n_fft, get_hop_size(hparams), fftsize=hparams.win_size, mode="speech")
@@ -123,11 +160,26 @@ def _griffin_lim(S, hparams):
 		y = _istft(S_complex * angles, hparams)
 	return y
 
+def _griffin_lim_tensorflow(S, hparams):
+	'''TensorFlow implementation of Griffin-Lim
+	Based on https://github.com/Kyubyong/tensorflow-exercises/blob/master/Audio_Processing.ipynb
+	'''
+	with tf.variable_scope('griffinlim'):
+		# TensorFlow's stft and istft operate on a batch of spectrograms; create batch of size 1
+		S = tf.expand_dims(S, 0)
+		S_complex = tf.identity(tf.cast(S, dtype=tf.complex64))
+		y = tf.contrib.signal.inverse_stft(S_complex, hparams.win_size, get_hop_size(hparams), hparams.n_fft)
+		for i in range(hparams.griffin_lim_iters):
+			est = tf.contrib.signal.stft(y, hparams.win_size, get_hop_size(hparams), hparams.n_fft)
+			angles = est / tf.cast(tf.maximum(1e-8, tf.abs(est)), tf.complex64)
+			y = tf.contrib.signal.inverse_stft(S_complex * angles, hparams.win_size, get_hop_size(hparams), hparams.n_fft)
+	return tf.squeeze(y, 0)
+
 def _stft(y, hparams):
 	if hparams.use_lws:
 		return _lws_processor(hparams).stft(y).T
 	else:
-		return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
+		return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size, pad_mode='constant')
 
 def _istft(y, hparams):
 	return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
@@ -155,11 +207,16 @@ def pad_lr(x, fsize, fshift):
 	return pad, pad + r
 ##########################################################
 #Librosa correct padding
-def librosa_pad_lr(x, fsize, fshift):
-	'''compute right padding (final frame)
+def librosa_pad_lr(x, fsize, fshift, pad_sides=1):
+	'''compute right padding (final frame) or both sides padding (first and final frames)
 	'''
-	return int(fsize // 2)
-
+	assert pad_sides in (1, 2)
+	# return int(fsize // 2)
+	pad = (x.shape[0] // fshift + 1) * fshift - x.shape[0]
+	if pad_sides == 1:
+		return 0, pad
+	else:
+		return pad // 2, pad // 2 + pad % 2
 
 # Conversions
 _mel_basis = None
@@ -177,6 +234,12 @@ def _mel_to_linear(mel_spectrogram, hparams):
 		_inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))
 	return np.maximum(1e-10, np.dot(_inv_mel_basis, mel_spectrogram))
 
+def _mel_to_linear_tensorflow(mel_spectrogram, hparams):
+	global _inv_mel_basis
+	if _inv_mel_basis is None:
+		_inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))
+	return tf.transpose(tf.maximum(1e-10, tf.matmul(tf.cast(_inv_mel_basis, tf.float32), tf.transpose(mel_spectrogram, [1, 0]))), [1, 0])
+
 def _build_mel_basis(hparams):
 	assert hparams.fmax <= hparams.sample_rate // 2
 	return librosa.filters.mel(hparams.sample_rate, hparams.n_fft, n_mels=hparams.num_mels,
@@ -189,6 +252,9 @@ def _amp_to_db(x, hparams):
 def _db_to_amp(x):
 	return np.power(10.0, (x) * 0.05)
 
+def _db_to_amp_tensorflow(x):
+	return tf.pow(tf.ones(tf.shape(x)) * 10.0, x * 0.05)
+
 def _normalize(S, hparams):
 	if hparams.allow_clipping_in_normalization:
 		if hparams.symmetric_mels:
@@ -216,3 +282,17 @@ def _denormalize(D, hparams):
 		return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)
 	else:
 		return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)
+
+def _denormalize_tensorflow(D, hparams):
+	if hparams.allow_clipping_in_normalization:
+		if hparams.symmetric_mels:
+			return (((tf.clip_by_value(D, -hparams.max_abs_value,
+				hparams.max_abs_value) + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value))
+				+ hparams.min_level_db)
+		else:
+			return ((tf.clip_by_value(D, 0, hparams.max_abs_value) * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)
+
+	if hparams.symmetric_mels:
+		return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)
+	else:
+		return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)
diff --git a/datasets/preprocessor.py b/datasets/preprocessor.py
@@ -69,13 +69,23 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
 			wav_path))
 		return None
 
+	#Trim lead/trail silences
+	if hparams.trim_silence:
+		wav = audio.trim_silence(wav, hparams)
+
+	#Pre-emphasize
+	preem_wav = audio.preemphasis(wav, hparams.preemphasis, hparams.preemphasize)
+
 	#rescale wav
 	if hparams.rescale:
 		wav = wav / np.abs(wav).max() * hparams.rescaling_max
+		preem_wav = preem_wav / np.abs(preem_wav).max() * hparams.rescaling_max
 
-	#M-AILABS extra silence specific
-	if hparams.trim_silence:
-		wav = audio.trim_silence(wav, hparams)
+		#Assert all audio is in [-1, 1]
+		if (wav > 1.).any() or (wav < -1.).any():
+			raise RuntimeError('wav has invalid value: {}'.format(wav_path))
+		if (preem_wav > 1.).any() or (preem_wav < -1.).any():
+			raise RuntimeError('wav has invalid value: {}'.format(wav_path))
 
 	#Mu-law quantize
 	if is_mulaw_quantize(hparams.input_type):
@@ -85,6 +95,7 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
 		#Trim silences
 		start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
 		wav = wav[start: end]
+		preem_wav = preem_wav[start: end]
 		out = out[start: end]
 
 		constant_values = mulaw_quantize(0, hparams.quantize_channels)
@@ -103,14 +114,14 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
 		out_dtype = np.float32
 
 	# Compute the mel scale spectrogram from the wav
-	mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
+	mel_spectrogram = audio.melspectrogram(preem_wav, hparams).astype(np.float32)
 	mel_frames = mel_spectrogram.shape[1]
 
 	if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:
 		return None
 
 	#Compute the linear scale spectrogram from the wav
-	linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)
+	linear_spectrogram = audio.linearspectrogram(preem_wav, hparams).astype(np.float32)
 	linear_frames = linear_spectrogram.shape[1]
 
 	#sanity check
@@ -125,10 +136,10 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
 		out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
 	else:
 		#Ensure time resolution adjustement between audio and mel-spectrogram
-		pad = audio.librosa_pad_lr(wav, hparams.n_fft, audio.get_hop_size(hparams))
+		l_pad, r_pad = audio.librosa_pad_lr(wav, hparams.n_fft, audio.get_hop_size(hparams), hparams.wavenet_pad_sides)
 
-		#Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)
-		out = np.pad(out, pad, mode='reflect')
+		#Reflect pad audio signal on the right (Just like it's done in Librosa to avoid frame inconsistency)
+		out = np.pad(out, (l_pad, r_pad), mode='constant', constant_values=constant_values)
 
 	assert len(out) >= mel_frames * audio.get_hop_size(hparams)