You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we are very limited in what we can do when launching the training script. Some of the things that come to mind are:
Validation. What even are we doing with the prompt-1@@@resolution1:::prompt-2@@@resolution2 😆 We need more controllability for this as we set sights onto more involved training algorithms and better validation control goes a long way.
We should also allow specifying a CSV/text file for validation directly. It can be one or more files and everything should be chained together (similar to "Chaining Datasets" below). This is just for a convenient experience when trying to do some larger training runs.
Different learning rates for different layers. From past training experiences with other trainers for image models, I've found that using higher learning rate in earlier blocks and lower learning rates in later blocks seems to be effective for training loras fast. We should allow regex based setting of learning rates. In the JSON, I envision something like this:
Chaining datasets. In general, the dataset part needs a big refactor and needs to support a lot more data formats. We should leverage HF datasets where required too. It is too tightly coupled and we can't do different pre-processing for different things which is usually needed especially with different datasets. If we allow using different datasets, it saves users the time from pre-processing things themselves beforehand (which is currently a limitation and a frustrating experience for me too).
In my mind, this enables two things:
Being able to use same dataset more than once to allow multi-resolution training more easily (on same data), because we can specify different config options for different datasets. For example, if we have a 1024x1536 resolution video dataset and want to also train at the same aspect ratio with lower resolution
Easily leverage all the existing HF Hub splits of large datasets without users being required to preprocess everything into, say, a single CSV.
It would be difficult and verbose to implement all this with CLI options, so a JSON file with nested and varying levels of control for each of the above would be nice to have.
Any other suggestions for improvements are welcome. This is in preparation for distillation trainers, which will undoubtedly require some larger datasets for training and more control over these kinds of parameters. Before that, we need stable FSDP/Tensor/Pipeline-parallel support as well so that larger models like HunyuanVideo can be trained without OOMs. I think DDP training is now stable and haven't encountered any major errors yet, so we can start looking into refactoring parallelization related aspects. Will take these up one-by-one in small steps over the coming weeks.
The text was updated successfully, but these errors were encountered:
Chaining datasets. In general, the dataset part needs a big refactor and needs to support a lot more data formats. We should leverage HF datasets where required too.
For this, we could think of leveraging webdataset for its proven scalability and widespread adoption in the community, especially for images and videos.
I think the configurability aspects and the parallelization aspects can be worked on independently. I can give the former a try as I had previously envisioned using a YAML file for this.
I am still a little unsure about what you mean by "chaining" datasets. Perhaps a simple example would be helpful.
Currently, we are very limited in what we can do when launching the training script. Some of the things that come to mind are:
prompt-1@@@resolution1:::prompt-2@@@resolution2
😆 We need more controllability for this as we set sights onto more involved training algorithms and better validation control goes a long way.So, something like:
We should also allow specifying a CSV/text file for validation directly. It can be one or more files and everything should be chained together (similar to "Chaining Datasets" below). This is just for a convenient experience when trying to do some larger training runs.
Chaining datasets. In general, the dataset part needs a big refactor and needs to support a lot more data formats. We should leverage HF
datasets
where required too. It is too tightly coupled and we can't do different pre-processing for different things which is usually needed especially with different datasets. If we allow using different datasets, it saves users the time from pre-processing things themselves beforehand (which is currently a limitation and a frustrating experience for me too).In my mind, this enables two things:
It would be difficult and verbose to implement all this with CLI options, so a JSON file with nested and varying levels of control for each of the above would be nice to have.
Any other suggestions for improvements are welcome. This is in preparation for distillation trainers, which will undoubtedly require some larger datasets for training and more control over these kinds of parameters. Before that, we need stable FSDP/Tensor/Pipeline-parallel support as well so that larger models like HunyuanVideo can be trained without OOMs. I think DDP training is now stable and haven't encountered any major errors yet, so we can start looking into refactoring parallelization related aspects. Will take these up one-by-one in small steps over the coming weeks.
The text was updated successfully, but these errors were encountered: