CelebV-Text: A large-Scale Facial Text-Video Dataset
Jianhui Yu*,
Hao Zhu*,
Liming Jiang,
Chen Change Loy,
Weidong Cai,
and Wayne Wu
(*Equal contribution)
Demo Video | Project Page
| Paper (arxiv)
Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.
- [11/08/2023]
- Audios (67k) can be downloaded now issue
- [20/06/2023]
- Videos can be downloaded now issue
- [28/03/2023]
- Paper is now released here!
- [01/01/2023]
- [28/12/2022]
- The codebase and project page are created.
- The download and processing tools for the dataset is released. Use them to construct your CelebV-Text!
- [04/01/2024]
- Confusions about annotation files are expalined here.
- Video download and processing tools.
- Text descriptions.
- Data annotations.
- Code of MMVID-interp.
- Automatic text generation tool and templates.
- Pretrained models of benchmarks.
aa.mp4
The distributions of each attribute. CelebV-Text contains 70,000 video clips with a total duration of around 279 hours. Each video is accompanied by 20 sentences describing 6 designed attributes, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.
This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.
complex_input_demo.mp4
- The CelebV-Text dataset is available for non-commercial research purposes only.
- All videos of the CelebV-Text dataset are obtained from the Internet which are not property of our institutions. Our institutions are not responsible for the content nor the meaning of these videos.
- You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data.
- You agree not to further copy, publish or distribute any portion of the CelebV-Text dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.
Description | Link |
---|---|
general & detailed face attributes | Google Drive |
emotion | Google Drive |
action | Google Drive |
light direction | Google Drive |
light intensity | Google Drive |
light color temperature | Google Drive |
*metadata annotation | Google Drive |
Prepare the environment & Run script:
# prepare the environment
pip install youtube_dl
pip install opencv-python
# you can change the download folder in the code
python download_and_process.py
{
"clips":
{
"0-5BrmyFsYM_0": // clip 1
{
"ytb_id": "0-5BrmyFsYM", // youtube id
"duration": {"start_sec": 0.0, "end_sec": 9.64}, // start and end times in the original video
"bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872}, // bounding box
"version": "v0.1"
},
"00-30GQl0TM_7": // clip 2
{
"ytb_id": "00-30GQl0TM", // youtube id
"duration": {"start_frame": 415.29, "end_frame": 420.88}, // start and end times in the original video
"bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
"version": "v0.1"
},
"..."
"..."
}
}
To train the baselines, we used their original implementations in our paper:
Text Descriptions (MMVID) | Link |
---|---|
VQGAN | Google Drive |
general & detailed face attributes | Google Drive |
emotion | Google Drive |
action | Google Drive |
light direction | Google Drive |
light intensity & color temperature | Google Drive |
general face attributes + emotion + action + light direction | Google Drive |
There are several our previous publications that might be of interest to you.
-
Face Generation:
- (ECCV 2022) CelebV-HQ: A Large-scale Video Facial Attributes Dataset. Zhu et al. [Paper], [Project Page], [Dataset]
- (CVPR 2022) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing. Xu et al. [Paper], [Project Page], [Code]
-
Human Generation:
- (Tech. Report 2022) 3DHumanGAN: Towards Photo-realistic 3D-Aware Human Image Generation. Yang et al. [Paper], [Project Page], [Code]
- (ECCV 2022) StyleGAN-Human: A Data-Centric Odyssey of Human. Fu et al. [Paper], [Project Page], [Dataset]
- (SIGGRAPH 2022) Text2Human: Text-Driven Controllable Human Image Generation. Jiang et al. [Paper], [Project Page], [Code]
If you find this work useful for your research, please consider citing our paper:
@inproceedings{yu2022celebvtext,
title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
booktitle={CVPR},
year={2023}
}
CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).