-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split restore_training_state
into logical parts [1 / 2]
#7901
Conversation
CheckpointConnector.restore_training_state
into logical parts [1 / 2]
CheckpointConnector.restore_training_state
into logical parts [1 / 2]restore_training_state
into logical parts [1 / 2]
Codecov Report
@@ Coverage Diff @@
## master #7901 +/- ##
=======================================
+ Coverage 88% 92% +4%
=======================================
Files 204 200 -4
Lines 13667 12837 -830
=======================================
- Hits 12047 11819 -228
+ Misses 1620 1018 -602 |
""" Restores all callbacks from the pre-loaded checkpoint. """ | ||
if not self._loaded_checkpoint: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions:
- is there a risk of these restoration functions being called outside this context? should the start and end restore from checkpoint be replaced with a dedicated context manager?
- in splitting these out, should we be prescriptive about the order they are loaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, good question.
If you look at the "end result" #7652 (open to discussion) you will see here in the Trainer file resume_start() and resume_end() are actually in very different places, so I can't make it into a context manager.
Yes, I think it's best to document the order. The order may be important. In the future, we will want to enable configuration of what is restored, so some of these functions get called on demand and some won't be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, maybe a context manager could still work. I will investigate it in #7652
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Ananth's suggestion is good.
Also, could any other class want to call the start
and end
methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think we would only want to call it for unit testing, or the context manager if that works.
So I know what you are saying, yes I will put the underscores everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, ctx manager could kinda work but I see an issue. can we move the conversation to #7652 so I can directly point to the code in trainer.py?
if any([key in self._loaded_checkpoint for key in DEPRECATED_CHECKPOINT_KEYS]): | ||
raise ValueError( | ||
"The checkpoint you're attempting to load follows an" | ||
" outdated schema. You can upgrade to the current schema by running" | ||
" `python -m pytorch_lightning.utilities.upgrade_checkpoint --file model.ckpt`" | ||
" where `model.ckpt` is your checkpoint file." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this validation be done in resume_start
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. maybe we could.
one thought though, in the future we will have a way to configure what to load, so if these functions get called individually we may want to have the validation together with the particular objects that are being restored.
""" Restores all callbacks from the pre-loaded checkpoint. """ | ||
if not self._loaded_checkpoint: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Ananth's suggestion is good.
Also, could any other class want to call the start
and end
methods?
if "optimizer_states" not in self._loaded_checkpoint or "lr_schedulers" not in self._loaded_checkpoint: | ||
raise KeyError( | ||
"Trying to restore training state but checkpoint contains only the model." | ||
" This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this chain of PRs plan to tackle the issue of restoring part of the checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the aim directly , but it will definitely help, and we can directly continue with it after these PRs. There will be nothing standing in the way as far as I can tell :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it ! Really nice cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small comment
What does this PR do?
In #7900
CheckpointConnector.restore_training_state
gets split into multiple pieces:This PR introduces the new functions, but they are unused.
#7900 will then refactor the
CheckpointConnector.restore_training_state
method to use them.This is mainly to reduce a hard to read diff for reviews!
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃