Wrong train_state.step when resuming from checkpoint for the second time #571

LeoXinhaoLee · 2024-09-08T18:05:56Z

Hi, thank you for releasing this great codebase.

I noticed that if a job is interrupted twice (say first interruption at step 25, then resume and continue until step 45, then resume from step 45), the second time it resumes from the latest checkpoint will have an unexpected behavior that checkpoint.load() seems to find the latest ckpt correctly (say step 45), but the loaded train_state.step seems to still be that from the first resume (say 26).

An example logging info for a second-time resume is as follows:

[rank0]:2024-09-08 10:58:15,493 - root - INFO - Loading the checkpoint at step 45.  [90/1925]
[rank0]:2024-09-08 10:58:16,211 - root - INFO - Training starts at step 26, with local batch size 8, global batch size 8, sequence length 2048, total steps 100 (warmup 2)

Thank you very much for your help!

The text was updated successfully, but these errors were encountered:

LeoXinhaoLee · 2024-09-08T21:08:59Z

Seems like it's because after the first resume, the train_state in checkpoint has not been updated to point to the one being incremented during train loop. The below hotfix seems to solve the problem, but maybe a better solution should be considered that eliminate the root cause of this issue.

checkpoint_loaded = checkpoint.load()
checkpoint.states["train_state"] = train_state

fegin · 2024-10-30T16:01:51Z

#647 should fix the issue.

wz337 self-assigned this Sep 9, 2024

tianyu-l added the bug Something isn't working label Sep 9, 2024

awgu mentioned this issue Oct 30, 2024

[Config] Make the checkpoint step configurable. #662

Closed

fegin closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong train_state.step when resuming from checkpoint for the second time #571

Wrong train_state.step when resuming from checkpoint for the second time #571

LeoXinhaoLee commented Sep 8, 2024

LeoXinhaoLee commented Sep 8, 2024

fegin commented Oct 30, 2024

Wrong train_state.step when resuming from checkpoint for the second time #571

Wrong train_state.step when resuming from checkpoint for the second time #571

Comments

LeoXinhaoLee commented Sep 8, 2024

LeoXinhaoLee commented Sep 8, 2024

fegin commented Oct 30, 2024