-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[training] CogVideoX-I2V LoRA #9482
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
When will the SFT version of CogVideoX-5B T2V be available? |
We can add support for it soon. The changes to make CogVideoX T2V LoRA to full SFT should be very simple actually - you will have to remove all the lora related parts and make the transformer require gradients, and instead of saving lora weights, you can save the full model with |
Why didn't you run the following two lines of code after calculating the loss? |
Gradient accumulation is handled by |
But |
I'm not sure, but I've seen other models' training scripts run these line of code. (for distributed training) |
Hmm, I'm not too sure either what the case for distributed training should be. It seems like you might be right because I took a look at a few code bases and found this used too. Just for a sanity check, pinging @SunMarc here. Do we need an |
any update? |
As far as I know, the I am not sure if I am right, if there is a mistake please help me correct it. |
but if |
any update? |
We're working on a separate repository for memory-efficient and multiresolution finetuning of CogVideoX that will be open-sourced soon. This PR will probably not receive any further updates at the moment. I think |
Can the code of this PR run normally to fine-tune i2v (consistent with the official effect of CogVideoX)? |
No need for multi-resolution and memory-efficient, is this code now the same as |
The changes for I2V LoRA are as follows (in SAT):
Apart from these, if I'm missing anything, please let me know or feel free to open a PR for improvements. Maybe I can merge this as it is for now, and we can work on the others improvements later (it will be released as a separate repository in the near future). Here are my training runs: wandb Accelerate Configcompute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 2,3
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false Launch script#!/bin/bash
export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
export TORCHDYNAMO_VERBOSE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export WANDB_MODE="offline"
GPU_IDS="2,3"
LEARNING_RATES=("1e-3" "1e-4")
LR_SCHEDULES=("constant" "cosine_with_restarts")
OPTIMIZERS=("adam" "adamw")
EPOCHS=("30")
for learning_rate in "${LEARNING_RATES[@]}"; do
for lr_schedule in "${LR_SCHEDULES[@]}"; do
for optimizer in "${OPTIMIZERS[@]}"; do
for epochs in "${EPOCHS[@]}"; do
cache_dir="/raid/aryan/cogvideox-lora/"
output_dir="/raid/aryan/cogvideox-lora__optimizer_${optimizer}__epochs_${epochs}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate}/"
tracker_name="cogvideox-lora__optimizer_${optimizer}__epochs_${epochs}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate}"
cmd="accelerate launch --gpu_ids $GPU_IDS --config_file accelerate_configs/simple_uncompiled_v2.yaml examples/cogvideo/train_cogvideox_image_to_video_lora.py \
--pretrained_model_name_or_path /raid/aryan/CogVideoX-5b-I2V/ \
--cache_dir $cache_dir \
--instance_data_root /raid/aryan/video-dataset-disney/ \
--caption_column prompts.txt \
--video_column videos.txt \
--validation_prompt \"A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance\" \
--validation_images \"/raid/aryan/dataset-cogvideox/videos/frames_1_00.png:::/raid/aryan/dataset-cogvideox/videos/frames_2_00.png\" \
--validation_prompt_separator ::: \
--num_validation_videos 1 \
--validation_epochs 10 \
--seed 42 \
--rank 64 \
--lora_alpha 64 \
--mixed_precision bf16 \
--output_dir $output_dir \
--height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0 \
--train_batch_size 1 \
--num_train_epochs $epochs \
--checkpointing_steps 10000 \
--gradient_accumulation_steps 1 \
--learning_rate $learning_rate \
--lr_scheduler $lr_schedule \
--lr_warmup_steps 200 \
--lr_num_cycles 1 \
--enable_slicing \
--enable_tiling \
--gradient_checkpointing \
--optimizer $optimizer \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--max_grad_norm 1.0 \
--report_to wandb \
--nccl_timeout 1800"
echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------"
done
done
done
done |
@a-r-r-o-w Great work! I also want to fine-tune VAE model, is there any code pointer to do it? |
I found a big problem, you missed this line of code |
I'm extremely sorry for the time this would have cost you! I actually have this fixed in the latest repository that we will be releasing for finetuning cogvideox (hopefully this week as only some final tests are remaining), but I completely forgot to push the latest changes here 🫠 Thank you so much for reporting this though! Would you be okay if I added you as a co-author when pushing the fix here?
We haven't looked into yet but since there have been many requests (on the original CogVideo repo), we might consider providing a script for finetuning VAE too soon. cc @zRzRzRzRzRzRzR |
Thanks for your reply. @SHYuanBest informed me about this bug, and I would be delighted if he could also be listed as a co-author when you push the fix. |
It seems that deepspeed cannot run normally: |
Co-Authored-By: yuan-shenghai <[email protected]>
Co-Authored-By: Shenghai Yuan <[email protected]>
Nope, the only bug we had fixed was the |
Are you sure that |
DeepSpeed requires saving weights on every device; saving weights only on the main process would cause issues. |
Oh, I see what you mean, sorry! Will fix |
Co-Authored-By: yuan-shenghai <[email protected]>
I have located the problem. I seriously suspect that there is a problem in the implementation of |
Sorry for the wait ! I see that you are going to merge this soon. You don't have to gather the loss before calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Hi everyone. I just trained a LoRA for I2V thanks to all your work but now I'm having trouble using it for inference. |
You will have to install the latest diffusers release with |
May I ask about how much GPU memory is required to use this LoRA fine-tuning script? |
You could do it in less than 24gb for batch_size=1. More details available here: https://github.com/a-r-r-o-w/cogvideox-factory |
@a-r-r-o-w hi, we just develope an identity-preserving text-to-video generation model, ConsisID (base on CogVideoX-5B), which can keep human-identity consistent in the generated video. Can you help us to intergrate it into diffusers? Thanks. |
Ofcourse, I would love to! For this week, I'm quite busy but I will take a good look and start testing it this weekend. Thanks for the awesome work! |
@a-r-r-o-w hi arrow, if you need any help, just feel free to let me know |
@SHYuanBest Sorry, I was not able to find the time yet to try and integrate this. But I did try it out on ComfyUI and the results were very cool! If you could open a PR, we could help with reviews and try to integrate it faster. If that works, we can set up a communication channel on Slack with your team (cc @sayakpaul) |
sure, i have create a PR here #10140. I will update the code here as soon as possible. |
* update * update * update * update * update * add coauthor Co-Authored-By: yuan-shenghai <[email protected]> * add coauthor Co-Authored-By: Shenghai Yuan <[email protected]> * update Co-Authored-By: yuan-shenghai <[email protected]> * update --------- Co-authored-by: yuan-shenghai <[email protected]> Co-authored-by: Shenghai Yuan <[email protected]>
What does this PR do?
Image-to-Video LoRA finetuning for CogVideoX
bash script
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.