You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that my lora was degrading significantly after resuming training. Checking the logs before and after resuming shows this:
Session 1:
Training steps: 66%|██████▋ | 1995/3000 [6:02:27<2:37:04, 9.38s/it, step_loss=0.439, lr=6.93e-5]
Training steps: 66%|██████▋ | 1995/3000 [6:02:31<2:37:04, 9.38s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, step_loss=0.427, lr=6.93e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:32<3:54:34, 14.02s/it, grad_norm=0.0788, step_loss=0.227, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:33<3:54:34, 14.02s/it, step_loss=0.0343, lr=6.92e-5]
Training steps: 67%|██████▋ | 1996/3000 [6:02:34<3:54:34, 14.02s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, step_loss=0.222, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:34<2:52:54, 10.34s/it, grad_norm=0.278, step_loss=0.581, lr=6.92e-5]
Training steps: 67%|██████▋ | 1997/3000 [6:02:35<2:52:54, 10.34s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, step_loss=0.199, lr=6.92e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:02:41<2:34:52, 9.27s/it, grad_norm=0.188, step_loss=0.581, lr=6.91e-5]01/18/2025 15:11:52 - INFO - finetrainers - Memory after epoch 666: {
"memory_allocated": 14.656,
"memory_reserved": 22.004,
"max_memory_allocated": 21.208,
"max_memory_reserved": 22.973
}
Training steps: 67%|██████▋ | 1998/3000 [6:03:00<2:34:52, 9.27s/it, step_loss=0.452, lr=6.91e-5]
Training steps: 67%|██████▋ | 1998/3000 [6:03:03<2:34:52, 9.27s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, step_loss=0.621, lr=6.91e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:05<3:48:38, 13.71s/it, grad_norm=0.0859, step_loss=0.213, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.137, lr=6.9e-5]
Training steps: 67%|██████▋ | 1999/3000 [6:03:06<3:48:38, 13.71s/it, step_loss=0.127, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [6:03:07<2:48:47, 10.13s/it, step_loss=0.127, lr=6.9e-5]
Session 2:
Training steps: 67%|██████▋ | 2000/3000 [00:00<?, ?it/s]
Training steps: 67%|██████▋ | 2000/3000 [00:20<?, ?it/s, step_loss=0.565, lr=6.9e-5]
Training steps: 67%|██████▋ | 2000/3000 [00:23<?, ?it/s, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=0.931, lr=6.9e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, grad_norm=0.224, step_loss=0.603, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:26<7:15:36, 26.16s/it, step_loss=1.17, lr=6.89e-5]
Training steps: 67%|██████▋ | 2001/3000 [00:27<7:15:36, 26.16s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, step_loss=1.15, lr=6.89e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:28<3:18:17, 11.92s/it, grad_norm=1.65, step_loss=1.15, lr=6.88e-5]
Training steps: 67%|██████▋ | 2002/3000 [00:29<3:18:17, 11.92s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, step_loss=1.5, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:35<2:43:35, 9.84s/it, grad_norm=2.62, step_loss=0.737, lr=6.88e-5]01/18/2025 16:55:02 - INFO - finetrainers - Memory after epoch 1001: {
"memory_allocated": 14.656,
"memory_reserved": 22.695,
"max_memory_allocated": 21.195,
"max_memory_reserved": 22.695
}
Training steps: 67%|██████▋ | 2003/3000 [00:55<2:43:35, 9.84s/it, step_loss=0.549, lr=6.88e-5]
Training steps: 67%|██████▋ | 2003/3000 [00:59<2:43:35, 9.84s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=0.857, lr=6.88e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, grad_norm=0.154, step_loss=0.985, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:01<4:26:41, 16.07s/it, step_loss=1.29, lr=6.87e-5]
Training steps: 67%|██████▋ | 2004/3000 [01:02<4:26:41, 16.07s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, step_loss=1.14, lr=6.87e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:02<3:01:13, 10.93s/it, grad_norm=2.07, step_loss=0.941, lr=6.86e-5]
Training steps: 67%|██████▋ | 2005/3000 [01:03<3:01:13, 10.93s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, step_loss=1.2, lr=6.86e-5]
Training steps: 67%|██████▋ | 2006/3000 [01:10<2:40:43, 9.70s/it, grad_norm=0.395, step_loss=0.964, lr=6.86e-5]01/18/2025 16:55:37 - INFO - finetrainers - Memory after epoch 1002: {
"memory_allocated": 14.654,
"memory_reserved": 22.881,
"max_memory_allocated": 21.197,
"max_memory_reserved": 22.881
}
Note the loss difference. Lr scheduling looks like expected. I see now that the epochs are different when resuming 666 vs 1000.
I confirmed with another lora, trained in fp8. When resuming without fp8 the loss continues like in the previous session, when using fp8, it's wildly off.
I'm using the ui, so technically it could be an issue there, but I've had no problems resuming before, and no problems with fp8 on its own.
If resuming from checkpoint with fp8 has been tested before, I'll assume it's some issue on my part.
In case it's a config issue, here's the one I used:
System Info / 系統信息
I noticed that my lora was degrading significantly after resuming training. Checking the logs before and after resuming shows this:
Note the loss difference. Lr scheduling looks like expected. I see now that the epochs are different when resuming 666 vs 1000.
I confirmed with another lora, trained in fp8. When resuming without fp8 the loss continues like in the previous session, when using fp8, it's wildly off.
I'm using the ui, so technically it could be an issue there, but I've had no problems resuming before, and no problems with fp8 on its own.
If resuming from checkpoint with fp8 has been tested before, I'll assume it's some issue on my part.
In case it's a config issue, here's the one I used:
Information / 问题信息
Reproduction / 复现过程
Expected behavior / 期待表现
resuming from checkpoint with fp8 should not be different from not using fp8
The text was updated successfully, but these errors were encountered: