Official implementation of "Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism".
System requirements: Linux, a GPU, conda
, and up to 10 GB of disk space (you may adjust the paths to write to an external hard drive).
Clone the repository:
git clone https://github.com/g-milis/NEUTART && cd NEUTART
Create a virtual environment and use it for all the commands in this guide, unless specified otherwise:
conda env create -f environment.yml && conda activate NEUTART
Before downloading the pretrained models, you need to create a FLAME account (the setup script will ask for your credentials). Then, download the required assets and pretrained models with:
./setup.sh
The script downloads only the missing files, so it can rerun if necessary.
Due to licensing reasons, you have to follow the next two installation steps manually. Do not worry, they provide detailed instructions.
The photorealistric preprocessing step employs the FSGAN face segmenter, which requires a simple form to obtain it. Please follow the instructions to download lfw_figaro_unet_256_2_0_segmentation_v1.pth
(from the v1
folder) and place it under photorealistic/preprocessing/segmentation
.
Finally, you need to create a texture model with BFM_to_FLAME. Please follow the instructions to create the model FLAME_albedo_from_BFM.npz
and place it under spectre/data
.
In the assets
directory you will find reference videos for two TCD-TIMIT subjects. Note that the default one for inference is 21M
. Process the reference video with:
./photorealistic/preprocess.sh test processed_videos/21M_test assets/21M
Now run:
./inference.sh
You may easily adjust the inference parameters inside the script. Also see misc/batch_inference.sh
. If you want to run inference after training on your own subjects, make sure to reserve a few-second clip to use as reference.
We will finetune the pre-trained renderer in checkpoints/meta-renderer
. You need 5-10 minutes of identically lit training clips of your subject. You can find out their total duration with ./misc/duration.sh
. Suppose you have the clips in assets/09F_train
. Process them with:
./photorealistic/preprocess.sh train <SUBJECT_DIR> assets/09F_train
where <SUBJECT_DIR>
is the directory where the processing outputs will be saved, for instance processed_videos/09F_train
. Please note that the photorealistic preprocessing has to precede the audiovisual preprocessing.
Train the renderer with ./photorealistic/train.sh <SUBJECT_NAME> <SUBJECT_DIR>
, as in:
./photorealistic/train.sh 09F processed_videos/09F_train
You may adjust the training script according to the options in photorealistic/renderer/options
.
You need to download the Montreal Forced Aligner (MFA):
conda install montreal-forced-aligner
If it's causing any trouble, just create a dedicated environment for mfa
commands:
conda create -n mfa && conda activate mfa && conda config --add channels conda-forge && conda install montreal-forced-aligner
Once installed, run with the mfa
environment:
mfa model download acoustic english_us_arpa
Let's finetune the pre-trained multispeaker model in checkpoints/TCD-TIMIT
on clips of a single speaker. Suppose you have a set of .mp4
clips along with their transcriptions. We will use TCD-TIMIT subject 09F
as an example.
-
Copy the contents of
audiovisual/config/21M
toaudiovisual/config/09F
and adjustpreprocess.yaml
according to the prompts. The scripts with a-d
argument expect the directory name of your dataset's configs, underconfig
(e.g.21M
and09F
). -
The MFA expects a file structure
<DATASET>/<SPEAKER>/<UTTERANCE>.(wav|lab)
, in our case09F/09F/*.(wav|lab)
(in the single speaker case, the dataset and the speaker coincide). You can either construct it manually, or runaudiovisual/prepare_align.py
after modifying it accordingly:
python audiovisual/prepare_align.py -d 09F
- Perform the text-to-audio alignment. Use the
<TTS_PATH>
and<PROCESSED_PATH>
that you specified inpreprocess.yaml
(e.g.tts_data/09F
andprocessed_data/09F
):
mfa align <TTS_PATH> audiovisual/text/librispeech-lexicon.txt english_us_arpa <PROCESSED_PATH>
- Preprocess the aligned data using the script below. Bear in mind that the preprocessor expects the videos in the format
<UTTERANCE>.mp4
. If you don't have this format, you can just adjust thevideo_path
variable inPreprocessor.build_from_path()
:
python audiovisual/preprocess.py -d 09F
- Adjust
audiovisual/config/09F/train.yaml
according to the prompts and train the audiovisual module with:
python audiovisual/train.py -d 09F
For training on in-the-wild videos, make sure you select a subject with adequate training footage, neutral head pose, and consistnet lighting throughout the video. We will demonstrate on a sample of the HDTF dataset.
- Download the video with:
yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" -o assets/HDTF_raw.mp4 https://www.youtube.com/watch?v=jSw_GUq6ato
- Change fps to 25 (recommended) and convert to a
wav
file:
ffmpeg -i assets/HDTF_raw.mp4 -r 25 -c:v libx264 assets/HDTF.mp4
ffmpeg -i assets/HDTF.mp4 assets/HDTF.wav
-
Transcribe it using the system of your choice. You may find this Whisper API useful, since it also splits the speech into smaller utterances. Save the resulting JSON file and provide it as an input to the next step.
-
Use the file
misc/split_video.py
and change the parameters at the top to match your configuration. Then, run the python file to split your video into clips. From then on, you can move forward by following the generic steps above.
- Please inspect the output of
./photorealistic/preprocess.sh
, making sure the images look consistent. - If your subject is not from TCD-TIMIT, you might have to adjust the parameters in
audiovisual.utils.lipread.cut_mouth
before preprocessing, in order to write the mouth clips. The videos need to be cropped from the bottom of the nose to the chin, with the mouth centered, likemisc/mouth.mp4
(you may comment out the lineNormalize(mean, std)
for easier inspection, but please reinclude it). - The default training configuration updates all the weights in the audiovisual module, but specifying
transfer=True
allows you to freeze the encoder weights if you want to train on few data. - Examine the training progress of the audiovisual module with
tensorboard --logdir checkpoints/<DATASET>/audiovisual/log
. - If the output of the audiovisual module is jittery, please increase the
weight_vector
for the regularization lossexp_reg_loss
inaudiovisual.model.loss.AudiovisualLoss.forward
. - Examine the training progress of the photorealistic module by opening
checkpoints/09F/photorealistic/web/index.html
.
Deep learning systems for photorealistic talking head generation like NEUTART can have a very positive impact in many applications such as digital avatars, virtual assistants, accessibility tools, teleconferencing, video games, movie dubbing, and human-machine interfaces. However, this type of technology has the risk of being misused towards unethical or malicious purposes, since it can produce deepfake videos of individuals without their consent. We believe that researchers and engineers working in this field need to be mindful of these ethical issues and contribute to raising public awareness about the capabilities of such AI systems, as well as the development of state-of-the-art deepfake detectors. In our work, generated videos are always presented as synthetic, either explicitly or implicitly (when clearly implied by the context), and we encourage you to follow this practice. Please make sure that you understand the conditions under which this project is licensed, before using it.
Our code is based on these great projects:
Please do not hesitate to raise issues, create pull requests, or ask questions via email. I will reply swiftly.
If you find this work useful for your research, please cite our paper:
@misc{milis2023neural,
title={Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism},
author={Milis, Georgios and Filntisis, Panagiotis P. and Roussos, Anastasios and Maragos, Petros},
journal={arXiv preprint arXiv:2312.06613},
year={2023}
}