Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong train_state.step when resuming from checkpoint for the second time #571

Closed
LeoXinhaoLee opened this issue Sep 8, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@LeoXinhaoLee
Copy link

Hi, thank you for releasing this great codebase.

I noticed that if a job is interrupted twice (say first interruption at step 25, then resume and continue until step 45, then resume from step 45), the second time it resumes from the latest checkpoint will have an unexpected behavior that checkpoint.load() seems to find the latest ckpt correctly (say step 45), but the loaded train_state.step seems to still be that from the first resume (say 26).

An example logging info for a second-time resume is as follows:

[rank0]:2024-09-08 10:58:15,493 - root - INFO - Loading the checkpoint at step 45.  [90/1925]
[rank0]:2024-09-08 10:58:16,211 - root - INFO - Training starts at step 26, with local batch size 8, global batch size 8, sequence length 2048, total steps 100 (warmup 2)

Thank you very much for your help!

@LeoXinhaoLee
Copy link
Author

Seems like it's because after the first resume, the train_state in checkpoint has not been updated to point to the one being incremented during train loop. The below hotfix seems to solve the problem, but maybe a better solution should be considered that eliminate the root cause of this issue.

checkpoint_loaded = checkpoint.load()
checkpoint.states["train_state"] = train_state

@wz337 wz337 self-assigned this Sep 9, 2024
@tianyu-l tianyu-l added the bug Something isn't working label Sep 9, 2024
@fegin
Copy link
Contributor

fegin commented Oct 30, 2024

#647 should fix the issue.

@fegin fegin closed this as completed Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants