-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always run validation inside the training loop epoch #7357
Conversation
Hello @carmocca! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-05-26 00:55:54 UTC |
Codecov Report
@@ Coverage Diff @@
## master #7357 +/- ##
======================================
- Coverage 93% 92% -0%
======================================
Files 199 199
Lines 12971 12960 -11
======================================
- Hits 12001 11952 -49
- Misses 970 1008 +38 |
633ae0c
to
b156d49
Compare
b156d49
to
df4d846
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the cleanup @carmocca !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it a lot!
Two questions to the cons you mentioned around the loader logic:
- Having the data loader in memory shouldn't be a problem as before you also had both datasets in memory (the memory footprint of the loader itself is very low)
- Assuming you're talking about the worker processes: Shouldn't the train loader's worker processes vanish (of course only if set as non-persistent) as soon as they have finished iterating over the dataset (which would still be the case from my understanding)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff
I haven't checked if this is happening but what if your batches are huge and you keep the last train batch in GPU while you run validation? This is what I was referring to. |
Recap for readers from the future:
In previous versions, depending on your training flags (namely
val_check_interval
), validation would either (1) start after the training batches have run but before the training epoch finishes or (2) after the training epoch finishes. pseudocode:(1)
(2)
This release changes the flow so that (2) is always used regardless of the trainer configuration. This has the advantage of facilitating the writing of callbacks since now it’s easier to support code that can run validation at the end of the training epoch or in the middle of it.
The breaking change comes from the fact that after this change, validation code/callbacks that relied on metrics logged or aggregated on epoch end will not work. This is to be expected as they will not be available when validation runs.
For example, if your early stopping callback was tracking a metric logged on
training_epoch_end
, you will need to setEarlyStopping(check_on_train_epoch_end=True)
.What does this PR do?
The current training epoch structure follows the following pattern:
(
max_epochs=1, limit_train_batches=1, limit_val_batches=1
)Pros:
on_train_epoch_end
andon_train_end
.This means when validation is run at the end of the epoch (the classic loop structure), validation is considered to be out of the training scope. This is inconsistent with the pattern when validation is run in the middle of the epoch:
This PR changes the code so the later pattern is always used.
Pros:
Cons:
on_train_epoch_end
or insidetraining_step
viaself.log(..., on_epoch=True)
. This is a breaking change.EarlyStopping
users can setcheck_on_train_epoch_end=True
to avoid the error.Before submitting
PR review