SpaceTimeGPT is a video description generation model capable of spatial and temporal reasoning. Given a video, eight frames are sampled and analyzed by the model. The output is a sentence description of the events that occured in the video, generated using autoregression.
Video Encoder: TimeSformer
Text Decoder: GPT 2
The encoder and decoder are initialized using pretrained weights for video classification and sentence completion, respectively. Encoder-decoder cross attention is used to unify the visual and linguistic domains. The model is fine-tuned end-to-end on the video captioning task.
SpaceTimeGPT is trained on VATEX, a large video captioning dataset.
Performance: 67.3 CIDEr on the VATEX test split
Sampling method: 30
The easiest way to inference the model is through its Hugging Face checkpoint.
Alternatively, you may create your own dataset and train the model using the files provided:
- src
- process_data - create Hugging Face Dataset using video and caption files
- train - perform training
- inference - perform inference
- dataset
- vatex_{split} - VATEX {split} video IDs and captions
- videoID_captions - all video IDs and captions
- result
- generated_captions - model output on VATEX test split
- vatex_test_cider - ground-truth captions of VATEX test split, in CIDEr format
- vatex_utils - utility scripts for acquiring and processing VATEX data