-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix/lr schedulers update calling order #7708
[WIP] Fix/lr schedulers update calling order #7708
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7708 +/- ##
=======================================
- Coverage 92% 88% -5%
=======================================
Files 199 199
Lines 12963 12963
=======================================
- Hits 11968 11374 -594
- Misses 995 1589 +594 |
callbacks=[train_step_checkpoint_callback, val_epoch_checkpoint_callback] | ||
) | ||
trainer.fit(model) | ||
step_idx = data_length * max_epochs - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be just
step = trainer.global_step - 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right!
train_step_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_train_step", every_n_train_steps=1) | ||
val_epoch_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_val_epoch", every_n_val_epochs=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to save/load a checkpoint?
Can't you just test the state on_train_batch_end
?
@@ -494,6 +494,9 @@ def run_training_epoch(self): | |||
if batch_output.signal == -1: | |||
break | |||
|
|||
# update LR schedulers | |||
self.update_lr_schedulers('step') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you are using a ReduceLROnPlateau
scheduler linked to a metric logged after this point? I believe this would fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @carmocca
Exactly, thanks for pointing it out.
Will work around it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @carmocca , I realize that this could be a bit tricky. Because pl.module on_train_batch_end
callbacks may log metrics, the LR scheduler must be updated following those callbacks. However, we must also make sure that a Checkpoint's on_batch_batch_end
hook is called after the scheduler update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @carmocca , in this case, the same issue may also exist for the validation part?
if not should_check_val or should_train_only:
self.update_lr_schedulers('epoch')
if should_train_only:
self.check_checkpoint_callback(True)
if the metric is added in on_validation_end
callbacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long story short, in that snippet you only update the epoch schedulers if running only training
and we had another call to update epoch schedulers if running validation too.
So no issue
However, this was just updated in #7357. It has no conflicts with this PR but you should rebase master regardless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @carmocca , it seems that this issue has been fixed by your updates!
https://colab.research.google.com/drive/1bBkhGiKJoavp1O4Oi4kLWVnxohl5HR6M?usp=sharing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Sorry I didn't check myself.
Lovely when you unknowingly fix something :D
What does this PR do?
Fixes #7637
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃