Possible issue when resuming from checkpoint with fp8 #246

neph1 · 2025-01-25T07:31:12Z

System Info / 系統信息

I noticed that my lora was degrading significantly after resuming training. Checking the logs before and after resuming shows this:

Session 1:

Training steps: 66%|██████▋ | 1995/3000 [6:02:27<2:37:04, 9.38s/it, step_loss=0.439, lr=6.93e-5]
Training steps: 66%|██████▋ | 1995/3000 [6:02:31<2:37:04, 9.38s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, grad_norm=0.0788, step_loss=0.227, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:33<3:54:34, 14.02s/it, step_loss=0.0343, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:34<3:54:34, 14.02s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, grad_norm=0.278, step_loss=0.581, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:35<2:52:54, 10.34s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, grad_norm=0.188, step_loss=0.581, lr=6.91e-5]01/18/2025 15:11:52 - INFO - finetrainers - Memory after epoch 666: {
"memory_allocated": 14.656,
"memory_reserved": 22.004,
"max_memory_allocated": 21.208,
"max_memory_reserved": 22.973
}

Training steps: 67%|██████▋ | 1998/3000 [6:03:00<2:34:52, 9.27s/it, step_loss=0.452, lr=6.91e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:03:03<2:34:52, 9.27s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, grad_norm=0.0859, step_loss=0.213, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.137, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.127, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [6:03:07<2:48:47, 10.13s/it, step_loss=0.127, lr=6.9e-5]

Session 2:

Training steps: 67%|██████▋ | 2000/3000 [00:00<?, ?it/s]
Training steps: 67%|██████▋ | 2000/3000 [00:20<?, ?it/s, step_loss=0.565, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [00:23<?, ?it/s, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, grad_norm=0.224, step_loss=0.603, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=1.17, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:27<7:15:36, 26.16s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, grad_norm=1.65, step_loss=1.15, lr=6.88e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:29<3:18:17, 11.92s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, grad_norm=2.62, step_loss=0.737, lr=6.88e-5]01/18/2025 16:55:02 - INFO - finetrainers - Memory after epoch 1001: {
"memory_allocated": 14.656,
"memory_reserved": 22.695,
"max_memory_allocated": 21.195,
"max_memory_reserved": 22.695
}

Training steps: 67%|██████▋ | 2003/3000 [00:55<2:43:35, 9.84s/it, step_loss=0.549, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:59<2:43:35, 9.84s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, grad_norm=0.154, step_loss=0.985, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=1.29, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:02<4:26:41, 16.07s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, grad_norm=2.07, step_loss=0.941, lr=6.86e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:03<3:01:13, 10.93s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, grad_norm=0.395, step_loss=0.964, lr=6.86e-5]01/18/2025 16:55:37 - INFO - finetrainers - Memory after epoch 1002: {
"memory_allocated": 14.654,
"memory_reserved": 22.881,
"max_memory_allocated": 21.197,
"max_memory_reserved": 22.881
}

Note the loss difference. Lr scheduling looks like expected. I see now that the epochs are different when resuming 666 vs 1000.

I confirmed with another lora, trained in fp8. When resuming without fp8 the loss continues like in the previous session, when using fp8, it's wildly off.
I'm using the ui, so technically it could be an issue there, but I've had no problems resuming before, and no problems with fp8 on its own.

If resuming from checkpoint with fp8 has been tested before, I'll assume it's some issue on my part.

In case it's a config issue, here's the one I used:

accelerate_config: uncompiled_1.yaml
allow_tf32: true
batch_size: 4
beta1: 0.9
beta2: 0.95
caption_column: prompts.txt
caption_dropout_p: 0.05
caption_dropout_technique: empty
checkpointing_limit: 3
checkpointing_steps: 200
data_root: 'test_129'
dataloader_num_workers: 0
dataset_file: ''
diffusion_options: ''
enable_model_cpu_offload: ''
enable_slicing: true
enable_tiling: true
epsilon: 1e-8
gpu_ids: '0'
gradient_accumulation_steps: 3
gradient_checkpointing: true
id_token: afkx
image_resolution_buckets: 512x768 720x416 416x736 512x720 720x480 480x720
layerwise_upcasting_modules: transformer
layerwise_upcasting_skip_modules_pattern: patch_embed pos_embed x_embedder context_embedder
  ^proj_in$ ^proj_out$ norm
layerwise_upcasting_storage_dtype: float8_e5m2
lora_alpha: 256
lr: 0.0002
lr_num_cycles: 1
lr_scheduler: linear
lr_warmup_steps: 100
max_grad_norm: 1
model_name: ltx_video
nccl_timeout: 1800
num_validation_videos: 0
optimizer: adamw
output_dir: test_129
pin_memory: true
precompute_conditions: ''
pretrained_model_name_or_path: ltx-0.9.1
rank: 128
report_to: none
resume_from_checkpoint: test_129/checkpoint-2800
seed: 42
target_modules: to_q to_k to_v to_out.0
text_encoder_2_dtype: bf16
text_encoder_3_dtype: bf16
text_encoder_dtype: bf16
tracker_name: finetrainers
train_steps: 3000
training_type: lora
transformer_dtype: bf16
use_8bit_bnb: ''
vae_dtype: bf16
validation_epochs: 0
validation_prompt_separator: ':::'
validation_prompts: ''
validation_steps: 100
video_column: videos.txt
video_resolution_buckets: 129x640x360 129x448x256 129x360x192 129x720x416 129x576x448
  129x416x736 129x608x320 1x720x416 1x512x768 1x720x416 1x416x736 1x512x720 1x720x480
  1x480x720
weight_decay: 0.001

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

train a lora
resume from checkpoint with fp8 and layer_wise_upcasting enabled.
observe that step loss seems to be as if reset.

Expected behavior / 期待表现

resuming from checkpoint with fp8 should not be different from not using fp8

The text was updated successfully, but these errors were encountered:

sayakpaul · 2025-01-28T05:59:49Z

Can you share your training commands?

neph1 · 2025-01-31T08:24:11Z

I believe this was the command used on resume:
accelerate launch --config_file configs/uncompiled_1.yaml --gpu_ids 0 train.py --model_name ltx_video --pretrained_model_name_or_path models/ltx_video/ --text_encoder_dtype bf16 --text_encoder_2_dtype bf16 --text_encoder_3_dtype bf16 --vae_dtype bf16 --data_root data/test --video_column videos.txt --caption_column prompts.txt --id_token axtf --video_resolution_buckets 1x512x512 --image_resolution_buckets 512x512 --caption_dropout_p 0.05 --caption_dropout_technique empty --text_encoder_dtype bf16 --text_encoder_2_dtype bf16 --text_encoder_3_dtype bf16 --vae_dtype bf16 --transformer_dtype bf16 --dataloader_num_workers 0 --training_type lora --seed 425 --batch_size 28 --train_steps 3000 --rank 64 --lora_alpha 64 --target_modules to_q to_k to_v to_out.0 --gradient_accumulation_steps 1 --gradient_checkpointing --checkpointing_steps 200 --checkpointing_limit 3 --enable_slicing --enable_tiling --resume_from_checkpoint outputs/test/checkpoint-600 --optimizer adamw --lr 0.0002 --lr_scheduler linear --lr_warmup_steps 0 --lr_num_cycles 1 --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --epsilon 1e-8 --max_grad_norm 1 --num_validation_videos 0 --validation_steps 10000 --tracker_name finetrainers --output_dir outputs/test/ --nccl_timeout 1800 --report_to none

I'm also not sure about the comment that it works in non-fp8.

sayakpaul · 2025-02-10T11:53:37Z

Sorry for the delay on my end.

Some further clarifications:

Were your able to get expected results with non-FP8?
I didn't encounter any problems while resuming from checkpoints for both LoRA and non-LoRA cases.
How did you launch your first training command with FP8? I don't see any FP8 specific bits in your command in the first place.

If you could help us with minimally reproducible snippets to debug the issues further that would be very much appreciated. Thanks!

neph1 · 2025-02-12T07:48:10Z

Sorry for the confusion. I've ran a couple of more tests, and despite my original claim, it seems like it's an issue with the learning rate, and I should have realized the issue earlier.
When you continue training, you can't change the number of steps it's going to do. This will mess up the lr scheduler, and not in a predictable way either it seems.
I've verified by running just the original steps. The lr will continue its curve.
I have not been able to tweak the lr to get any good results if I change the total number of steps. So if you plan on continuing training, set steps HIGH from the start.

Edit: I realize this doesn't match up with the original post, where lr seems to continue. I will investigate some more on my own and repopen if I have a clear repro case

sayakpaul · 2025-02-12T07:53:13Z

True. We did a couple of recent fine-tuning runs with full-finetuning and obtained decent results:
https://huggingface.co/collections/finetrainers/video-effects-6798834eece6b7910c43870d

Perhaps you could give those settings a try?

neph1 closed this as completed Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue when resuming from checkpoint with fp8 #246

Possible issue when resuming from checkpoint with fp8 #246

neph1 commented Jan 25, 2025

sayakpaul commented Jan 28, 2025

neph1 commented Jan 31, 2025

sayakpaul commented Feb 10, 2025

neph1 commented Feb 12, 2025 •

edited

Loading

sayakpaul commented Feb 12, 2025

Possible issue when resuming from checkpoint with fp8 #246

Possible issue when resuming from checkpoint with fp8 #246

Comments

neph1 commented Jan 25, 2025

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

sayakpaul commented Jan 28, 2025

neph1 commented Jan 31, 2025

sayakpaul commented Feb 10, 2025

neph1 commented Feb 12, 2025 • edited Loading

sayakpaul commented Feb 12, 2025

neph1 commented Feb 12, 2025 •

edited

Loading