Skip to content
View celebv-text's full-sized avatar

Block or report celebv-text

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
celebv-text/README.md

CelebV-Text: A Large-Scale Facial Text-Video Dataset (CVPR 2023)

CelebV-Text: A large-Scale Facial Text-Video Dataset
Jianhui Yu*, Hao Zhu*, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu
(*Equal contribution)
Demo Video | Project Page | Paper (arxiv)

Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.

Updates

  • [11/08/2023]
    • Audios (67k) can be downloaded now issue
  • [20/06/2023]
    • Videos can be downloaded now issue
  • [28/03/2023]
    • Paper is now released here!
  • [01/01/2023]
    • Code of MMVID-interp is now released here.
    • Pretrained models of benchmarks are released here.
    • data annotation file is now released here.
  • [28/12/2022]
    • The codebase and project page are created.
    • The download and processing tools for the dataset is released. Use them to construct your CelebV-Text!
  • [04/01/2024]
    • Confusions about annotation files are expalined here.

Table of contents

TODO

  • Video download and processing tools.
  • Text descriptions.
  • Data annotations.
  • Code of MMVID-interp.
  • Automatic text generation tool and templates.
  • Pretrained models of benchmarks.

Dataset Statistics

aa.mp4

The distributions of each attribute. CelebV-Text contains 70,000 video clips with a total duration of around 279 hours. Each video is accompanied by 20 sentences describing 6 designed attributes, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.

video stats

text stats

text-video rel

Visual ChatGPT Demo

This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.

complex_input_demo.mp4

Agreement

  • The CelebV-Text dataset is available for non-commercial research purposes only.
  • All videos of the CelebV-Text dataset are obtained from the Internet which are not property of our institutions. Our institutions are not responsible for the content nor the meaning of these videos.
  • You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data.
  • You agree not to further copy, publish or distribute any portion of the CelebV-Text dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.

Dataset Download

(1) Text Descriptions & Metadata Annotation

Description Link
general & detailed face attributes Google Drive
emotion Google Drive
action Google Drive
light direction Google Drive
light intensity Google Drive
light color temperature Google Drive
*metadata annotation Google Drive

(2) Video Download Pipeline

Prepare the environment & Run script:

# prepare the environment
pip install youtube_dl
pip install opencv-python

# you can change the download folder in the code 
python download_and_process.py
JSON File Structure:
{
    "clips":
    {
        "0-5BrmyFsYM_0":  // clip 1 
        {
            "ytb_id": "0-5BrmyFsYM",                                        // youtube id
            "duration": {"start_sec": 0.0, "end_sec": 9.64},                // start and end times in the original video
            "bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872},  // bounding box
            "version": "v0.1"
        },
      
        "00-30GQl0TM_7":  // clip 2 
        {
            "ytb_id": "00-30GQl0TM",                                        // youtube id
            "duration": {"start_frame": 415.29, "end_frame": 420.88},       // start and end times in the original video
            "bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
            "version": "v0.1"
        },
        "..."
        "..."

    }
}

Benchmark on Facial Text-to-Video Generation

(1) Baselines

To train the baselines, we used their original implementations in our paper:

(2) Pretrained Models

Text Descriptions (MMVID) Link
VQGAN Google Drive
general & detailed face attributes Google Drive
emotion Google Drive
action Google Drive
light direction Google Drive
light intensity & color temperature Google Drive
general face attributes + emotion + action + light direction Google Drive

More Work May Interest You

There are several our previous publications that might be of interest to you.

  • Face Generation:

    • (ECCV 2022) CelebV-HQ: A Large-scale Video Facial Attributes Dataset. Zhu et al. [Paper], [Project Page], [Dataset]
    • (CVPR 2022) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing. Xu et al. [Paper], [Project Page], [Code]
  • Human Generation:

    • (Tech. Report 2022) 3DHumanGAN: Towards Photo-realistic 3D-Aware Human Image Generation. Yang et al. [Paper], [Project Page], [Code]
    • (ECCV 2022) StyleGAN-Human: A Data-Centric Odyssey of Human. Fu et al. [Paper], [Project Page], [Dataset]
    • (SIGGRAPH 2022) Text2Human: Text-Driven Controllable Human Image Generation. Jiang et al. [Paper], [Project Page], [Code]

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{yu2022celebvtext,
  title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
  author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
  booktitle={CVPR},
  year={2023}
}

Acknowledgement

CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).

Popular repositories Loading

  1. CelebV-Text CelebV-Text Public

    (CVPR 2023) CelebV-Text: A Large-Scale Facial Text-Video Dataset

    Python 394 33

  2. MMVID MMVID Public

    Forked from snap-research/MMVID

    [CVPR 2022] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

    Python 4

  3. celebv-text.github.io celebv-text.github.io Public

    JavaScript 1