Skip to content

Latest commit

 

History

History
27 lines (22 loc) · 2.53 KB

2017-12-22-124.md

File metadata and controls

27 lines (22 loc) · 2.53 KB
layout title byline arxiv tags summary
post
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Shen et al
1712.05884
neural-network
deep-learning
text-to-speech
wavenet
TTS
generative
mel-scale
LSTM
CNN
Tacotron 2 reproduces human speech at a creepy-good level by converting text first to a spectrogram and then to a waveform, using two predictive generative neural networks.

If you were developing a revolutionary way of synthesizing human-quality voice from text, why wouldn't you call it Tacotron 2? This paper shows off Google's new work on a novel text-to-speech (TTS) engine with just that name.

Tacotron 2 first maps freeform text to mel-scale spectrograms, and then a second WaveNet-based net acts as a vocoder, converting the spectrograms to waveform audio. WaveNet is an existing TTS engine that generates waveform audio; its main shortcoming is that it requires significant domain-specific text annotation (such as phenome or linguistic markers) prior to audio-generation. Tacotron 2's predecessor, Tacotron, used a simpler waveform-generation system after converting text to a spectrogram. This sequel project uses a WaveNet synthesizer instead. The mel scale is a sequence of frequencies judged by the human ear to be equidistant to their neighbors. There is no one true mel-scale or formula to derive one, but there are a handful of commonly used standards.

The spectrogram-generation network uses a short-time Fourier transform to convert to small windows of time, which can be mapped through LSTM to the time domain of the output spectrogram. The WaveNet vocoder is a slight modification of the existig WaveNet architecture that converts spectrograms instead of text to output waveforms.

Human evaluators slightly preferred human speech to the generated audio, but... only very slightly. The audio is surprisingly natural sounding, and I had a hard time deciding which audio was human utterance and which audio was generated. You can try and see if you do any better at their website.

One thing that stuck out to me was the prosody and emotional tone of the voice: There is a level of "fakeness" that was presciently and expertly captured by Scarlett Johansson's Samantha in Spike Jonze's Her. If you haven't seen that film, I would highly recommend watching it now so you know what to expect in our near future.