Skip to content

celebv-text/CelebV-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CelebV-Text: A Large-Scale Facial Text-Video Dataset (CVPR 2023)

CelebV-Text: A large-Scale Facial Text-Video Dataset
Jianhui Yu*, Hao Zhu*, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu
(*Equal contribution)
Demo Video | Project Page | Paper (arxiv)

Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.

Updates

  • [11/08/2023]
    • Audios (67k) can be downloaded now issue
  • [20/06/2023]
    • Videos can be downloaded now issue
  • [28/03/2023]
    • Paper is now released here!
  • [01/01/2023]
    • Code of MMVID-interp is now released here.
    • Pretrained models of benchmarks are released here.
    • data annotation file is now released here.
  • [28/12/2022]
    • The codebase and project page are created.
    • The download and processing tools for the dataset is released. Use them to construct your CelebV-Text!
  • [04/01/2024]
    • Confusions about annotation files are expalined here.

Table of contents

TODO

  • Video download and processing tools.
  • Text descriptions.
  • Data annotations.
  • Code of MMVID-interp.
  • Automatic text generation tool and templates.
  • Pretrained models of benchmarks.

Dataset Statistics

aa.mp4

The distributions of each attribute. CelebV-Text contains 70,000 video clips with a total duration of around 279 hours. Each video is accompanied by 20 sentences describing 6 designed attributes, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.

video stats

text stats

text-video rel

Visual ChatGPT Demo

This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.

complex_input_demo.mp4

Agreement

  • The CelebV-Text dataset is available for non-commercial research purposes only.
  • All videos of the CelebV-Text dataset are obtained from the Internet which are not property of our institutions. Our institutions are not responsible for the content nor the meaning of these videos.
  • You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data.
  • You agree not to further copy, publish or distribute any portion of the CelebV-Text dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.

Dataset Download

(1) Text Descriptions & Metadata Annotation

Description Link
general & detailed face attributes Google Drive
emotion Google Drive
action Google Drive
light direction Google Drive
light intensity Google Drive
light color temperature Google Drive
*metadata annotation Google Drive

(2) Video Download Pipeline

Prepare the environment & Run script:

# prepare the environment
pip install youtube_dl
pip install opencv-python

# you can change the download folder in the code 
python download_and_process.py
JSON File Structure:
{
    "clips":
    {
        "0-5BrmyFsYM_0":  // clip 1 
        {
            "ytb_id": "0-5BrmyFsYM",                                        // youtube id
            "duration": {"start_sec": 0.0, "end_sec": 9.64},                // start and end times in the original video
            "bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872},  // bounding box
            "version": "v0.1"
        },
      
        "00-30GQl0TM_7":  // clip 2 
        {
            "ytb_id": "00-30GQl0TM",                                        // youtube id
            "duration": {"start_frame": 415.29, "end_frame": 420.88},       // start and end times in the original video
            "bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
            "version": "v0.1"
        },
        "..."
        "..."

    }
}

Benchmark on Facial Text-to-Video Generation

(1) Baselines

To train the baselines, we used their original implementations in our paper:

(2) Pretrained Models

Text Descriptions (MMVID) Link
VQGAN Google Drive
general & detailed face attributes Google Drive
emotion Google Drive
action Google Drive
light direction Google Drive
light intensity & color temperature Google Drive
general face attributes + emotion + action + light direction Google Drive

More Work May Interest You

There are several our previous publications that might be of interest to you.

  • Face Generation:

    • (ECCV 2022) CelebV-HQ: A Large-scale Video Facial Attributes Dataset. Zhu et al. [Paper], [Project Page], [Dataset]
    • (CVPR 2022) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing. Xu et al. [Paper], [Project Page], [Code]
  • Human Generation:

    • (Tech. Report 2022) 3DHumanGAN: Towards Photo-realistic 3D-Aware Human Image Generation. Yang et al. [Paper], [Project Page], [Code]
    • (ECCV 2022) StyleGAN-Human: A Data-Centric Odyssey of Human. Fu et al. [Paper], [Project Page], [Dataset]
    • (SIGGRAPH 2022) Text2Human: Text-Driven Controllable Human Image Generation. Jiang et al. [Paper], [Project Page], [Code]

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{yu2022celebvtext,
  title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
  author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
  booktitle={CVPR},
  year={2023}
}

Acknowledgement

CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).

About

(CVPR 2023) CelebV-Text: A Large-Scale Facial Text-Video Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages