Tiro TTS is a text-to-speech API server which works with various TTS backends.
The service can accept either unnormalized text or a SSML document and respond with audio (MP3, Ogg Vorbis or raw 16 bit PCM) or speech marks, indicating the byte and time offset of each synthesized word in the request.
The full API documentation in OpenAPI 2 format is available online at tts.tiro.is. The documentation is auto-generated from src/schema.py.
Tiro talgervill er vefþjónusta fyrir talgervingu sem styður nokkra mismunandi bakenda. Þjónustan getur tekið við annað hvort ónormuðum texta eða SSML-skjali og svarað með hljóðskrá (MP3, Ogg Vorbis eða 16 bita PCM) eða speech marks sem gefa til kynna tímasetningar og staðsetningu hvers orðs í innsenda textanum.
Skjölun forritunarskila þjónustunnar á OpenAPI 2 sniði er að finna á tts.tiro.is, en hún er búin til út frá src/schema.py.
The models used are configured with a text SynthesisSet protobuf message supplied via the
environment variable TIRO_TTS_SYNTHESIS_SET_PB
. See conf/synthesis_set.local.pbtxt for an example.
There are currently four voices accessible at tts.tiro.is.
- Diljá: Female voice developed by Reykjavík University (FastSpeech2 + MelGAN).
- Diljá v2: Female voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
- Álfur: Male voice developed by Reykjavík University (FastSpeech2 + MelGAN).
- Álfur v2: Male voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
- Bjartur: Male voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
- Rósa: Female voice developed by Reykjavík University (ESPnet2 FastSpeech2 + Multiband MelGAN).
- Karl: Male voice on Amazon Polly.
- Dóra: Female voice on Amazon Polly.
The supported voice backends are described in
voice.proto. There are three different backends:
Fastspeech2MelganBackend
, Espnet2Backend
and a AWS Polly proxy backend
PollyBackend
.
The backend tiro.tts.Fastspeech2MelganBackend
uses models created with
cadia-lvl/FastSpeech2
and a vocoder created with
seungwonpark/melgan. Both the
FastSpeech2 and MelGAN models have to be converted to TorchScript models before
use. The converted models can also be downloaded:
- Álfur Fastspeech2 acoustic model optimized for x86 CPU inference
- Diljá Fastspeech2 acoustic model optimized for x86 CPU inference
- Álfur MelGAN vocoder
- Diljá MelGAN vocoder
To convert the vocoder to TorchScript you have to have access to the trained model and the audio files used to train it. There are two scripts necessary for the conversion //:melgan_preprocess and //:melgan_convert.
For the Diljá voice models from Reykjavik University (yet to be published) the steps to prepare the TorchScript MelGAN vocoder are:
Download the recordings:
mkdir wav
wget https://repository.clarin.is/repository/xmlui/bitstream/handle/20.500.12537/104/dilja.zip
unzip dilja.zip -d wav
Generate the input features:
bazel run :melgan_preprocess -- -c $PWD/src/lib/fastspeech/melgan/config/default.yaml -d $PWD/wav/c
Convert the vocoder model:
bazel run :melgan_convert -- -p $PATH_TO_ORIGNAL_MODEL -o $PWD/melgan_jit.pt -i $PWD/wav/c/audio
And then set melgan_uri
in
conf/synthesis_set.local.pbtxt to the path to
melgan_jit.pt
.
The model is converted to TorchScript using scripting, so no recordings are necessary. The script //:fastspeech_convert can be used to convert the model:
bazel run :fastspeech_convert -- -p $PATH_TO_ORIGNAL_MODEL -o $PWD/fastspeech_jit.pt
And then set fastspeech2_uri
in
conf/synthesis_set.local.pbtxt to the path to
fastspeech_jit.pt
.
There are two types of normalization referenced in voice.proto: BasicNormalizer
and
GrammatekNormalizer
. BasicNormalizer
is local and only handles stripping punctuation but the GrammatekNormalizer
is a gRPC service that implements com.grammatek.tts_frontent.TTSFrontend
,
such as grammatek/tts-frontend-service.
The voices are configured using Protobuf text file specified by
voice.proto. By default it is loaded from
conf/synthesis_set.pbtxt
but this can be changed by setting the environment
variable TIRO_TTS_SYNTHESIS_SET_PB
. See src/config.py for a
complete list of possible environment variables.
The project requires Python 3.8 and uses Bazel for building. To build and run a local development server use the script ./run.sh.
Docker can also be used to build the project:
docker build -t tiro-tts .
and then to run the server:
docker run -v DIR_WITH_MODELS:/models -v PATH_TO_SYNTHESIS_SET:/app/conf/synthesis_set.pbtxt \
-p 8000:8000 tiro-tts
The project uses To build and run a local development server use the script run.sh.
Tiro TTS is licensed under the Apache License, Version 2.0. See LICENSE for more details. Some individual files may be licensed under different licenses, according to their headers.
This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.