Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change json dataset to a more manageable format #239

Closed
neph1 opened this issue Jan 23, 2025 · 1 comment
Closed

Change json dataset to a more manageable format #239

neph1 opened this issue Jan 23, 2025 · 1 comment

Comments

@neph1
Copy link

neph1 commented Jan 23, 2025

Feature request / 功能建议

I'm struggling managing a dataset of only 10s of files, I can only imagine what it would be like with 100s of samples. Having path + prompt separate and unnumbered makes it difficult to ensure that each sample has the right prompt, or if one is missing (where?).
I propose changing the json dataset format to something like:

[
  {
    "type":"video",
    "path":"some/path",
    "prompt":"description"
  }
]

Motivation / 动机

This makes it easy to also add additional information later on, if only for house-keeping, like "fps", and "frames", "width", etc etc.
Maybe even wrap the array in an additional layer, if a dataset could use other information too?

Your contribution / 您的贡献

If the proposal is accepted, I can make a pr for this.

@a-r-r-o-w
Copy link
Owner

Hey, JSON is already supported :) Well the code has a huge backlog pending for a refactor, but it works decently well for small datasets

def _load_dataset_from_json(self) -> Tuple[List[str], List[str]]:

You need to pass --dataset_root as the location where the videos are located, --dataset_file as the json file, --video_column as the name of the attribute that contains the path (path must be relative to dataset_root), and --caption_column as the attribute containing the prompt. LMK if something doesn't work and please feel free to modify the code and submit PRs to make it better suited for general use

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants