-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checkpoint load step #716
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It'd be great to also update docs/checkpoint.md
torchtitan/config_manager.py
Outdated
default=0, | ||
help="Load the checkpoint at the specified step. If 0, load the latest checkpoint.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default should be -1? https://github.com/pytorch/torchtitan/blob/main/torchtitan/checkpoint.py#L452
You can just run a local script and checkpoint every 10 steps and then load again from 20th step. The logging should show the training starts from 20th step. If |
@fegin I think the original ask in #662 was to resume from a particular step, not necessarily the latest one. |
What I meant is that if |
oh I was assuming Mark's point was how to write a light-weight unit test for it. I think it is not blocking as unit tests for checkpointing in general are missing. I think it's worth updating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Before merge, can you also help update the how-to in docs/checkpoint.md with one more bullet point?
Added it, I can do another PR to better document checkpoint.md since right now it just lists configs |
Fixes #662
followed @fegin advice to test this and indeed things are working https://gist.github.com/msaroufim/2925b3f17b631bf370a49f185b6e169d