Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix/lr schedulers update calling order #7708

Conversation

Lucklyric
Copy link
Contributor

@Lucklyric Lucklyric commented May 25, 2021

What does this PR do?

Fixes #7637

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented May 25, 2021

Codecov Report

Merging #7708 (bd1076e) into master (7e2f7e9) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7708    +/-   ##
=======================================
- Coverage      92%     88%    -5%     
=======================================
  Files         199     199            
  Lines       12963   12963            
=======================================
- Hits        11968   11374   -594     
- Misses        995    1589   +594     

@Lucklyric Lucklyric marked this pull request as ready for review May 25, 2021 21:36
callbacks=[train_step_checkpoint_callback, val_epoch_checkpoint_callback]
)
trainer.fit(model)
step_idx = data_length * max_epochs - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be just

step = trainer.global_step - 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right!

Comment on lines +109 to +110
train_step_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_train_step", every_n_train_steps=1)
val_epoch_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_val_epoch", every_n_val_epochs=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to save/load a checkpoint?

Can't you just test the state on_train_batch_end?

@@ -494,6 +494,9 @@ def run_training_epoch(self):
if batch_output.signal == -1:
break

# update LR schedulers
self.update_lr_schedulers('step')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you are using a ReduceLROnPlateau scheduler linked to a metric logged after this point? I believe this would fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @carmocca
Exactly, thanks for pointing it out.
Will work around it further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @carmocca , I realize that this could be a bit tricky. Because pl.module on_train_batch_end callbacks may log metrics, the LR scheduler must be updated following those callbacks. However, we must also make sure that a Checkpoint's on_batch_batch_end hook is called after the scheduler update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @carmocca , in this case, the same issue may also exist for the validation part?

        if not should_check_val or should_train_only:
            self.update_lr_schedulers('epoch')

        if should_train_only:
            self.check_checkpoint_callback(True)

if the metric is added in on_validation_end callbacks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long story short, in that snippet you only update the epoch schedulers if running only training
and we had another call to update epoch schedulers if running validation too.
So no issue

However, this was just updated in #7357. It has no conflicts with this PR but you should rebase master regardless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @carmocca , it seems that this issue has been fixed by your updates!
https://colab.research.google.com/drive/1bBkhGiKJoavp1O4Oi4kLWVnxohl5HR6M?usp=sharing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Sorry I didn't check myself.

Lovely when you unknowingly fix something :D

@Lucklyric Lucklyric changed the title Fix/lr schedulers update calling order [WIP] Fix/lr schedulers update calling order May 26, 2021
@awaelchli awaelchli added the bug Something isn't working label May 27, 2021
@awaelchli awaelchli added this to the v1.3.x milestone May 27, 2021
@Lucklyric Lucklyric closed this May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LR scheduler steps after saving checkpoint with iteration-based checkpointing
3 participants