-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes resume_from_checkpoint warning to error #7075
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7075 +/- ##
========================================
- Coverage 92% 86% -7%
========================================
Files 196 199 +3
Lines 12569 15504 +2935
========================================
+ Hits 11590 13257 +1667
- Misses 979 2247 +1268 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Mind adding a simple test for this?
I've fixed the test for this @ananthsub at |
It actually looks like the previous behavior was expected, as you can see there is even a test there and the docstring of the test says it explicitly that no error should occur :) |
@awaelchli I have changed the test to expect a FileNotFoundError. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Carlos Mocholí <[email protected]>
We could put an additional warning/info into the docs for more visibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, instead update the warning and/or docs as you see fit to have better visibility 👍
cc: @dsquareindia who created the issue and might have an idea of how this could be clearer
I'm not sure how to better update visibility since I believe the previous message was clear enough if that was the expected behavior. imo if the checkpoint isn't found, I thought it should've been an error instead of a warning since it's a direct conflict with an explicit argument not found because of some error in specifying a location. |
It's mostly a design decision but I do agree with @vballoli in that if there is an incorrect argument being passed, there should be an error. That being said, it's of course not great to break existing workflows and I can't fully judge the measure of the downstream impact this change would have; something that the maintainers can do. If the latter approach is what's preferred, I can't think of anything else other than maybe making the warning like |
Hey @awaelchli @carmocca, I will tend to agree with @vballoli on this one. If a user explicitly provides a
Best, |
I agree too. But unfortunately, #4402 was merged and we overlooked this which means changing this is going to have BC implications for users. So we have 3 options going forward:
What do you prefer? |
I basically agree the discussion of BC-break change. Even before #4402 PR, Pytorch-Lightning have been had auto-resume feature for SLURM (docs). If we totally remove auto-resume feature, it will break long-used (& of course useful) feature. |
yes... 100% agree here. |
i |
sorry, meant to cancel a comment not close it. was going to agree with @carmocca’s option 2 |
@tarepan, can't you rely on the check Thomas proposed?:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as we are going with option (2)
@carmocca CKPT_PATH = "/~~~~"
ckpt_callback = ModelCheckpoint(dirpath=CKPT_PATH, ...)
nullable_ckpt = CKPT_PATH if get_filesystem(CKPT_PATH).exists(CKPT_PATH) else None
Trainer(callbacks=[ckpt_callback], resume_from_checkpoint=nullable_ckpt) With save & Nullable-load, we can auto-resume. It's cool, thanks @tchaton. |
Hi, the test can be updated and because this is change is backward incompatible, we should not forget the changelog entry. I will send a PR. |
What does this PR do?
Raises a
FileNotFoundError
if an invalid checkpoint path is passed.Fixes #7072
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃