[WIP] Fix/lr schedulers update calling order #7708

Lucklyric · 2021-05-25T20:31:15Z

What does this PR do?

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

fix issue: Lightning-AI#7637

codecov · 2021-05-25T20:32:43Z

Codecov Report

Merging #7708 (bd1076e) into master (7e2f7e9) will decrease coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7708    +/-   ##
=======================================
- Coverage      92%     88%    -5%     
=======================================
  Files         199     199            
  Lines       12963   12963            
=======================================
- Hits        11968   11374   -594     
- Misses        995    1589   +594

carmocca · 2021-05-26T01:11:40Z

tests/checkpointing/test_trainer_checkpoint.py

+        callbacks=[train_step_checkpoint_callback, val_epoch_checkpoint_callback]
+    )
+    trainer.fit(model)
+    step_idx = data_length * max_epochs - 1


this could be just

step = trainer.global_step - 1

You are right!

carmocca · 2021-05-26T01:15:11Z

tests/checkpointing/test_trainer_checkpoint.py

+    train_step_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_train_step", every_n_train_steps=1)
+    val_epoch_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_val_epoch", every_n_val_epochs=1)


Do we really need to save/load a checkpoint?

Can't you just test the state on_train_batch_end?

carmocca · 2021-05-26T01:15:55Z

pytorch_lightning/trainer/training_loop.py

@@ -494,6 +494,9 @@ def run_training_epoch(self):
            if batch_output.signal == -1:
                break

+            # update LR schedulers
+            self.update_lr_schedulers('step')


what if you are using a ReduceLROnPlateau scheduler linked to a metric logged after this point? I believe this would fail

Dear @carmocca
Exactly, thanks for pointing it out.
Will work around it further.

Hi @carmocca , I realize that this could be a bit tricky. Because pl.module on_train_batch_end callbacks may log metrics, the LR scheduler must be updated following those callbacks. However, we must also make sure that a Checkpoint's on_batch_batch_end hook is called after the scheduler update.

Hi @carmocca , in this case, the same issue may also exist for the validation part?

if not should_check_val or should_train_only: self.update_lr_schedulers('epoch') if should_train_only: self.check_checkpoint_callback(True)

if the metric is added in on_validation_end callbacks?

Long story short, in that snippet you only update the epoch schedulers if running only training
and we had another call to update epoch schedulers if running validation too.
So no issue

However, this was just updated in #7357. It has no conflicts with this PR but you should rebase master regardless

Hi @carmocca , it seems that this issue has been fixed by your updates!
https://colab.research.google.com/drive/1bBkhGiKJoavp1O4Oi4kLWVnxohl5HR6M?usp=sharing

Oh! Sorry I didn't check myself.

Lovely when you unknowingly fix something :D

Lucklyric added 3 commits May 25, 2021 14:19

fix: calling update_lr_schedulers after on_train_batch_end hook

e295c9c

fix issue: Lightning-AI#7637

feat: add test

84950b9

chore: update changelog

f3eec87

chore: rm unused import

bd1076e

Lucklyric marked this pull request as ready for review May 25, 2021 21:36

Lucklyric requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners May 25, 2021 21:36

carmocca reviewed May 26, 2021

View reviewed changes

Lucklyric changed the title ~~Fix/lr schedulers update calling order~~ [WIP] Fix/lr schedulers update calling order May 26, 2021

awaelchli added the bug Something isn't working label May 27, 2021

awaelchli added this to the v1.3.x milestone May 27, 2021

Lucklyric closed this May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix/lr schedulers update calling order #7708

[WIP] Fix/lr schedulers update calling order #7708

Lucklyric commented May 25, 2021 •

edited

Loading

codecov bot commented May 25, 2021 •

edited

Loading

carmocca May 26, 2021

Lucklyric May 26, 2021

carmocca May 26, 2021

carmocca May 26, 2021

Lucklyric May 26, 2021

Lucklyric May 26, 2021

Lucklyric May 27, 2021

carmocca May 27, 2021

Lucklyric May 28, 2021

carmocca May 28, 2021

		train_step_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_train_step", every_n_train_steps=1)
		val_epoch_checkpoint_callback = ModelCheckpoint(dirpath=f"{tmpdir}/every_val_epoch", every_n_val_epochs=1)

[WIP] Fix/lr schedulers update calling order #7708

[WIP] Fix/lr schedulers update calling order #7708

Conversation

Lucklyric commented May 25, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented May 25, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lucklyric commented May 25, 2021 •

edited

Loading

codecov bot commented May 25, 2021 •

edited

Loading