Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to specify config.json instead of individual training parameters in script #193

Open
a-r-r-o-w opened this issue Jan 7, 2025 · 1 comment
Assignees

Comments

@a-r-r-o-w
Copy link
Owner

Currently, we are very limited in what we can do when launching the training script. Some of the things that come to mind are:

  • Validation. What even are we doing with the prompt-1@@@resolution1:::prompt-2@@@resolution2 😆 We need more controllability for this as we set sights onto more involved training algorithms and better validation control goes a long way.

So, something like:

{
  ...
  "validation_args": [
    {
      "prompt": prompt1,
      "height": height1,
      "width": width1,
      "guidance_scale": guidance_scale1,
      ...
    }
  ],
  ...
}

We should also allow specifying a CSV/text file for validation directly. It can be one or more files and everything should be chained together (similar to "Chaining Datasets" below). This is just for a convenient experience when trying to do some larger training runs.

  • Different learning rates for different layers. From past training experiences with other trainers for image models, I've found that using higher learning rate in earlier blocks and lower learning rates in later blocks seems to be effective for training loras fast. We should allow regex based setting of learning rates. In the JSON, I envision something like this:
{
  ...
  "optimizer_args": {
     "target_modules": [("transformer_blocks.[0-9].attn1", 1e-5), ("transformer_blocks.[10-20].attn1", 5e-6)],
  }
}
  • Chaining datasets. In general, the dataset part needs a big refactor and needs to support a lot more data formats. We should leverage HF datasets where required too. It is too tightly coupled and we can't do different pre-processing for different things which is usually needed especially with different datasets. If we allow using different datasets, it saves users the time from pre-processing things themselves beforehand (which is currently a limitation and a frustrating experience for me too).

    In my mind, this enables two things:

    • Being able to use same dataset more than once to allow multi-resolution training more easily (on same data), because we can specify different config options for different datasets. For example, if we have a 1024x1536 resolution video dataset and want to also train at the same aspect ratio with lower resolution
    • Easily leverage all the existing HF Hub splits of large datasets without users being required to preprocess everything into, say, a single CSV.

It would be difficult and verbose to implement all this with CLI options, so a JSON file with nested and varying levels of control for each of the above would be nice to have.

Any other suggestions for improvements are welcome. This is in preparation for distillation trainers, which will undoubtedly require some larger datasets for training and more control over these kinds of parameters. Before that, we need stable FSDP/Tensor/Pipeline-parallel support as well so that larger models like HunyuanVideo can be trained without OOMs. I think DDP training is now stable and haven't encountered any major errors yet, so we can start looking into refactoring parallelization related aspects. Will take these up one-by-one in small steps over the coming weeks.

@a-r-r-o-w a-r-r-o-w self-assigned this Jan 7, 2025
@sayakpaul
Copy link
Collaborator

Chaining datasets. In general, the dataset part needs a big refactor and needs to support a lot more data formats. We should leverage HF datasets where required too.

For this, we could think of leveraging webdataset for its proven scalability and widespread adoption in the community, especially for images and videos.

I think the configurability aspects and the parallelization aspects can be worked on independently. I can give the former a try as I had previously envisioned using a YAML file for this.

I am still a little unsure about what you mean by "chaining" datasets. Perhaps a simple example would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants