Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue when resuming from checkpoint with fp8 #246

Open
1 of 2 tasks
neph1 opened this issue Jan 25, 2025 · 0 comments
Open
1 of 2 tasks

Possible issue when resuming from checkpoint with fp8 #246

neph1 opened this issue Jan 25, 2025 · 0 comments

Comments

@neph1
Copy link

neph1 commented Jan 25, 2025

System Info / 系統信息

I noticed that my lora was degrading significantly after resuming training. Checking the logs before and after resuming shows this:

Session 1:

Training steps: 66%|██████▋ | 1995/3000 [6:02:27<2:37:04, 9.38s/it, step_loss=0.439, lr=6.93e-5]
Training steps: 66%|██████▋ | 1995/3000 [6:02:31<2:37:04, 9.38s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, grad_norm=0.0788, step_loss=0.227, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:33<3:54:34, 14.02s/it, step_loss=0.0343, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:34<3:54:34, 14.02s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, grad_norm=0.278, step_loss=0.581, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:35<2:52:54, 10.34s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, grad_norm=0.188, step_loss=0.581, lr=6.91e-5]01/18/2025 15:11:52 - INFO - finetrainers - Memory after epoch 666: {
"memory_allocated": 14.656,
"memory_reserved": 22.004,
"max_memory_allocated": 21.208,
"max_memory_reserved": 22.973
}

Training steps: 67%|██████▋ | 1998/3000 [6:03:00<2:34:52, 9.27s/it, step_loss=0.452, lr=6.91e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:03:03<2:34:52, 9.27s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, grad_norm=0.0859, step_loss=0.213, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.137, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.127, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [6:03:07<2:48:47, 10.13s/it, step_loss=0.127, lr=6.9e-5]

Session 2:

Training steps: 67%|██████▋ | 2000/3000 [00:00<?, ?it/s]
Training steps: 67%|██████▋ | 2000/3000 [00:20<?, ?it/s, step_loss=0.565, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [00:23<?, ?it/s, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, grad_norm=0.224, step_loss=0.603, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=1.17, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:27<7:15:36, 26.16s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, grad_norm=1.65, step_loss=1.15, lr=6.88e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:29<3:18:17, 11.92s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, grad_norm=2.62, step_loss=0.737, lr=6.88e-5]01/18/2025 16:55:02 - INFO - finetrainers - Memory after epoch 1001: {
"memory_allocated": 14.656,
"memory_reserved": 22.695,
"max_memory_allocated": 21.195,
"max_memory_reserved": 22.695
}

Training steps: 67%|██████▋ | 2003/3000 [00:55<2:43:35, 9.84s/it, step_loss=0.549, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:59<2:43:35, 9.84s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, grad_norm=0.154, step_loss=0.985, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=1.29, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:02<4:26:41, 16.07s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, grad_norm=2.07, step_loss=0.941, lr=6.86e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:03<3:01:13, 10.93s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, grad_norm=0.395, step_loss=0.964, lr=6.86e-5]01/18/2025 16:55:37 - INFO - finetrainers - Memory after epoch 1002: {
"memory_allocated": 14.654,
"memory_reserved": 22.881,
"max_memory_allocated": 21.197,
"max_memory_reserved": 22.881
}

Note the loss difference. Lr scheduling looks like expected. I see now that the epochs are different when resuming 666 vs 1000.

I confirmed with another lora, trained in fp8. When resuming without fp8 the loss continues like in the previous session, when using fp8, it's wildly off.
I'm using the ui, so technically it could be an issue there, but I've had no problems resuming before, and no problems with fp8 on its own.

If resuming from checkpoint with fp8 has been tested before, I'll assume it's some issue on my part.

In case it's a config issue, here's the one I used:

accelerate_config: uncompiled_1.yaml
allow_tf32: true
batch_size: 4
beta1: 0.9
beta2: 0.95
caption_column: prompts.txt
caption_dropout_p: 0.05
caption_dropout_technique: empty
checkpointing_limit: 3
checkpointing_steps: 200
data_root: 'test_129'
dataloader_num_workers: 0
dataset_file: ''
diffusion_options: ''
enable_model_cpu_offload: ''
enable_slicing: true
enable_tiling: true
epsilon: 1e-8
gpu_ids: '0'
gradient_accumulation_steps: 3
gradient_checkpointing: true
id_token: afkx
image_resolution_buckets: 512x768 720x416 416x736 512x720 720x480 480x720
layerwise_upcasting_modules: transformer
layerwise_upcasting_skip_modules_pattern: patch_embed pos_embed x_embedder context_embedder
  ^proj_in$ ^proj_out$ norm
layerwise_upcasting_storage_dtype: float8_e5m2
lora_alpha: 256
lr: 0.0002
lr_num_cycles: 1
lr_scheduler: linear
lr_warmup_steps: 100
max_grad_norm: 1
model_name: ltx_video
nccl_timeout: 1800
num_validation_videos: 0
optimizer: adamw
output_dir: test_129
pin_memory: true
precompute_conditions: ''
pretrained_model_name_or_path: ltx-0.9.1
rank: 128
report_to: none
resume_from_checkpoint: test_129/checkpoint-2800
seed: 42
target_modules: to_q to_k to_v to_out.0
text_encoder_2_dtype: bf16
text_encoder_3_dtype: bf16
text_encoder_dtype: bf16
tracker_name: finetrainers
train_steps: 3000
training_type: lora
transformer_dtype: bf16
use_8bit_bnb: ''
vae_dtype: bf16
validation_epochs: 0
validation_prompt_separator: ':::'
validation_prompts: ''
validation_steps: 100
video_column: videos.txt
video_resolution_buckets: 129x640x360 129x448x256 129x360x192 129x720x416 129x576x448
  129x416x736 129x608x320 1x720x416 1x512x768 1x720x416 1x416x736 1x512x720 1x720x480
  1x480x720
weight_decay: 0.001

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

  1. train a lora
  2. resume from checkpoint with fp8 and layer_wise_upcasting enabled.
  3. observe that step loss seems to be as if reset.

Expected behavior / 期待表现

resuming from checkpoint with fp8 should not be different from not using fp8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant