Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue when resuming from checkpoint with fp8 #246

Closed
1 of 2 tasks
neph1 opened this issue Jan 25, 2025 · 5 comments
Closed
1 of 2 tasks

Possible issue when resuming from checkpoint with fp8 #246

neph1 opened this issue Jan 25, 2025 · 5 comments

Comments

@neph1
Copy link

neph1 commented Jan 25, 2025

System Info / 系統信息

I noticed that my lora was degrading significantly after resuming training. Checking the logs before and after resuming shows this:

Session 1:

Training steps: 66%|██████▋ | 1995/3000 [6:02:27<2:37:04, 9.38s/it, step_loss=0.439, lr=6.93e-5]
Training steps: 66%|██████▋ | 1995/3000 [6:02:31<2:37:04, 9.38s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, grad_norm=0.0788, step_loss=0.227, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:33<3:54:34, 14.02s/it, step_loss=0.0343, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:34<3:54:34, 14.02s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, grad_norm=0.278, step_loss=0.581, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:35<2:52:54, 10.34s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, grad_norm=0.188, step_loss=0.581, lr=6.91e-5]01/18/2025 15:11:52 - INFO - finetrainers - Memory after epoch 666: {
"memory_allocated": 14.656,
"memory_reserved": 22.004,
"max_memory_allocated": 21.208,
"max_memory_reserved": 22.973
}

Training steps: 67%|██████▋ | 1998/3000 [6:03:00<2:34:52, 9.27s/it, step_loss=0.452, lr=6.91e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:03:03<2:34:52, 9.27s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, grad_norm=0.0859, step_loss=0.213, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.137, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.127, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [6:03:07<2:48:47, 10.13s/it, step_loss=0.127, lr=6.9e-5]

Session 2:

Training steps: 67%|██████▋ | 2000/3000 [00:00<?, ?it/s]
Training steps: 67%|██████▋ | 2000/3000 [00:20<?, ?it/s, step_loss=0.565, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [00:23<?, ?it/s, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, grad_norm=0.224, step_loss=0.603, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=1.17, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:27<7:15:36, 26.16s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, grad_norm=1.65, step_loss=1.15, lr=6.88e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:29<3:18:17, 11.92s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, grad_norm=2.62, step_loss=0.737, lr=6.88e-5]01/18/2025 16:55:02 - INFO - finetrainers - Memory after epoch 1001: {
"memory_allocated": 14.656,
"memory_reserved": 22.695,
"max_memory_allocated": 21.195,
"max_memory_reserved": 22.695
}

Training steps: 67%|██████▋ | 2003/3000 [00:55<2:43:35, 9.84s/it, step_loss=0.549, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:59<2:43:35, 9.84s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, grad_norm=0.154, step_loss=0.985, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=1.29, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:02<4:26:41, 16.07s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, grad_norm=2.07, step_loss=0.941, lr=6.86e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:03<3:01:13, 10.93s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, grad_norm=0.395, step_loss=0.964, lr=6.86e-5]01/18/2025 16:55:37 - INFO - finetrainers - Memory after epoch 1002: {
"memory_allocated": 14.654,
"memory_reserved": 22.881,
"max_memory_allocated": 21.197,
"max_memory_reserved": 22.881
}

Note the loss difference. Lr scheduling looks like expected. I see now that the epochs are different when resuming 666 vs 1000.

I confirmed with another lora, trained in fp8. When resuming without fp8 the loss continues like in the previous session, when using fp8, it's wildly off.
I'm using the ui, so technically it could be an issue there, but I've had no problems resuming before, and no problems with fp8 on its own.

If resuming from checkpoint with fp8 has been tested before, I'll assume it's some issue on my part.

In case it's a config issue, here's the one I used:

accelerate_config: uncompiled_1.yaml
allow_tf32: true
batch_size: 4
beta1: 0.9
beta2: 0.95
caption_column: prompts.txt
caption_dropout_p: 0.05
caption_dropout_technique: empty
checkpointing_limit: 3
checkpointing_steps: 200
data_root: 'test_129'
dataloader_num_workers: 0
dataset_file: ''
diffusion_options: ''
enable_model_cpu_offload: ''
enable_slicing: true
enable_tiling: true
epsilon: 1e-8
gpu_ids: '0'
gradient_accumulation_steps: 3
gradient_checkpointing: true
id_token: afkx
image_resolution_buckets: 512x768 720x416 416x736 512x720 720x480 480x720
layerwise_upcasting_modules: transformer
layerwise_upcasting_skip_modules_pattern: patch_embed pos_embed x_embedder context_embedder
  ^proj_in$ ^proj_out$ norm
layerwise_upcasting_storage_dtype: float8_e5m2
lora_alpha: 256
lr: 0.0002
lr_num_cycles: 1
lr_scheduler: linear
lr_warmup_steps: 100
max_grad_norm: 1
model_name: ltx_video
nccl_timeout: 1800
num_validation_videos: 0
optimizer: adamw
output_dir: test_129
pin_memory: true
precompute_conditions: ''
pretrained_model_name_or_path: ltx-0.9.1
rank: 128
report_to: none
resume_from_checkpoint: test_129/checkpoint-2800
seed: 42
target_modules: to_q to_k to_v to_out.0
text_encoder_2_dtype: bf16
text_encoder_3_dtype: bf16
text_encoder_dtype: bf16
tracker_name: finetrainers
train_steps: 3000
training_type: lora
transformer_dtype: bf16
use_8bit_bnb: ''
vae_dtype: bf16
validation_epochs: 0
validation_prompt_separator: ':::'
validation_prompts: ''
validation_steps: 100
video_column: videos.txt
video_resolution_buckets: 129x640x360 129x448x256 129x360x192 129x720x416 129x576x448
  129x416x736 129x608x320 1x720x416 1x512x768 1x720x416 1x416x736 1x512x720 1x720x480
  1x480x720
weight_decay: 0.001

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

  1. train a lora
  2. resume from checkpoint with fp8 and layer_wise_upcasting enabled.
  3. observe that step loss seems to be as if reset.

Expected behavior / 期待表现

resuming from checkpoint with fp8 should not be different from not using fp8

@sayakpaul
Copy link
Collaborator

Can you share your training commands?

@neph1
Copy link
Author

neph1 commented Jan 31, 2025

I believe this was the command used on resume:
accelerate launch --config_file configs/uncompiled_1.yaml --gpu_ids 0 train.py --model_name ltx_video --pretrained_model_name_or_path models/ltx_video/ --text_encoder_dtype bf16 --text_encoder_2_dtype bf16 --text_encoder_3_dtype bf16 --vae_dtype bf16 --data_root data/test --video_column videos.txt --caption_column prompts.txt --id_token axtf --video_resolution_buckets 1x512x512 --image_resolution_buckets 512x512 --caption_dropout_p 0.05 --caption_dropout_technique empty --text_encoder_dtype bf16 --text_encoder_2_dtype bf16 --text_encoder_3_dtype bf16 --vae_dtype bf16 --transformer_dtype bf16 --dataloader_num_workers 0 --training_type lora --seed 425 --batch_size 28 --train_steps 3000 --rank 64 --lora_alpha 64 --target_modules to_q to_k to_v to_out.0 --gradient_accumulation_steps 1 --gradient_checkpointing --checkpointing_steps 200 --checkpointing_limit 3 --enable_slicing --enable_tiling --resume_from_checkpoint outputs/test/checkpoint-600 --optimizer adamw --lr 0.0002 --lr_scheduler linear --lr_warmup_steps 0 --lr_num_cycles 1 --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --epsilon 1e-8 --max_grad_norm 1 --num_validation_videos 0 --validation_steps 10000 --tracker_name finetrainers --output_dir outputs/test/ --nccl_timeout 1800 --report_to none

I'm also not sure about the comment that it works in non-fp8.

@sayakpaul
Copy link
Collaborator

Sorry for the delay on my end.

Some further clarifications:

  • Were your able to get expected results with non-FP8?
  • I didn't encounter any problems while resuming from checkpoints for both LoRA and non-LoRA cases.
  • How did you launch your first training command with FP8? I don't see any FP8 specific bits in your command in the first place.

If you could help us with minimally reproducible snippets to debug the issues further that would be very much appreciated. Thanks!

@neph1
Copy link
Author

neph1 commented Feb 12, 2025

Sorry for the confusion. I've ran a couple of more tests, and despite my original claim, it seems like it's an issue with the learning rate, and I should have realized the issue earlier.
When you continue training, you can't change the number of steps it's going to do. This will mess up the lr scheduler, and not in a predictable way either it seems.
I've verified by running just the original steps. The lr will continue its curve.
I have not been able to tweak the lr to get any good results if I change the total number of steps. So if you plan on continuing training, set steps HIGH from the start.

Edit: I realize this doesn't match up with the original post, where lr seems to continue. I will investigate some more on my own and repopen if I have a clear repro case

@neph1 neph1 closed this as completed Feb 12, 2025
@sayakpaul
Copy link
Collaborator

True. We did a couple of recent fine-tuning runs with full-finetuning and obtained decent results:
https://huggingface.co/collections/finetrainers/video-effects-6798834eece6b7910c43870d

Perhaps you could give those settings a try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants