-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordering of hooks #8670
Comments
@carmocca Any idea there ? |
looks similar to #8654 |
The order was changed in #7357. See the linked PR for its reasoning.
Can you elaborate on why you can't anymore? Is it because you use the |
@carmocca |
Dear @mmgxa, I am not sure to follow how you can't get the loss on This seem to work fine. def test_epoch_end_hooks(tmpdir):
seed_everything(42)
class TestModel(BoringModel):
def training_step(self, batch, batch_idx):
loss = super().training_step(batch, batch_idx)
loss["batch_idx"] = batch_idx
return loss
def validation_step(self, batch, batch_idx):
loss = super().training_step(batch, batch_idx)
loss["batch_idx"] = -1 * batch_idx
return loss
def training_epoch_end(self, outputs) -> None:
assert sum(x["loss"] for x in outputs).item() == 12.22606086730957
assert sum(x["batch_idx"] for x in outputs) == sum(range(5))
def validation_epoch_end(self, outputs) -> None:
assert sum(x["loss"] for x in outputs).item() == 10.310195922851562
assert sum(x["batch_idx"] for x in outputs) == -1 * sum(range(3))
model = TestModel()
trainer = Trainer(
default_root_dir=tmpdir,
max_epochs=1,
limit_train_batches=5,
limit_val_batches=3,
num_sanity_val_steps=0,
)
trainer.fit(model) |
@tchaton Not sure what this code does. But take a look at the following two notebooks and please try to explain why the results are different? https://colab.research.google.com/github/mmg10/pl_bug/blob/main/pl_test_138.ipynb (Both train/valid loss should be the same as in the third cell - which is the case for 1.3.8, but not for 1.4.0. In PL 1.4, the train loss is 0 for first epoch, which is wrong, and in the second epoch, it reports the loss for the second step/batch only!) |
@ananthsub but My thoughts on #8690? Well, since it doesn't make sense to have the |
@mmgxa so the reason you are seeing a different behavior is as you said the hook order changed. You were computing a |
Order of hooks is changed: Lightning-AI/pytorch-lightning#8670
🐛 Bug
In PL 1.4, the order of hooks has changed.
in PL 1.3.8, it was
Now, in PL1.4, it is
i.e.
training_epoch_end
runs aftervalidation_epoch_end
instead of the lasttraining_step
, which doesn't make sense sinceon_epoch_end
is 'just next to it'. Also, note the proximity of the twoon_epoch_end
in PL 1.4To Reproduce
You can use the following Colab link:
https://colab.research.google.com/github/mmg10/pl_bug/blob/main/pl_bug_138.ipynb
https://colab.research.google.com/github/mmg10/pl_bug/blob/main/pl_bug_140.ipynb
Environment
PyTorch Lightning 1.3.8 and 1.4.0 respectively
Significance
In PL 1.3.8, we could get the average of training loss across batches via
but now we can't, Note that we still can run the following
since the
validation_epoch_end
is preceeded by the lastvalidation_step
The text was updated successfully, but these errors were encountered: