SpaceTimeGPT - Video Captioning Model

(partial diagrams from 1, 2, 3)

SpaceTimeGPT is a video description generation model capable of spatial and temporal reasoning. Given a video, eight frames are sampled and analyzed by the model. The output is a sentence description of the events that occured in the video, generated using autoregression.

Architecture and Training

Video Encoder: TimeSformer

Text Decoder: GPT 2

The encoder and decoder are initialized using pretrained weights for video classification and sentence completion, respectively. Encoder-decoder cross attention is used to unify the visual and linguistic domains. The model is fine-tuned end-to-end on the video captioning task.

Dataset and Evaluation

SpaceTimeGPT is trained on VATEX, a large video captioning dataset.

Performance: 67.3 CIDEr on the VATEX test split

Sampling method: 30 $\le$ generated tokens $\le$ 60, beam search with 8 beams

Usage and Project Files

The easiest way to inference the model is through its Hugging Face checkpoint.

Alternatively, you may create your own dataset and train the model using the files provided:

src
- process_data - create Hugging Face Dataset using video and caption files
- train - perform training
- inference - perform inference
dataset
- vatex_{split} - VATEX {split} video IDs and captions
- videoID_captions - all video IDs and captions
result
- generated_captions - model output on VATEX test split
- vatex_test_cider - ground-truth captions of VATEX test split, in CIDEr format
vatex_utils - utility scripts for acquiring and processing VATEX data

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
dataset		dataset
result		result
src		src
vatex_utils		vatex_utils
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
model.JPG		model.JPG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpaceTimeGPT - Video Captioning Model

Architecture and Training

Dataset and Evaluation

Usage and Project Files

About

Languages

Neleac/SpaceTimeGPT

Folders and files

Latest commit

History

Repository files navigation

SpaceTimeGPT - Video Captioning Model

Architecture and Training

Dataset and Evaluation

Usage and Project Files

About

Resources

Stars

Watchers

Forks

Languages