Ordering of hooks #8670

mmgxa · 2021-08-02T05:21:44Z

🐛 Bug

In PL 1.4, the order of hooks has changed.

in PL 1.3.8, it was

on_train_epoch_start
training_step
training_step
training_step
training_step
training_epoch_end
on_epoch_end
on_validation_epoch_start
validation_step
validation_step
validation_step
validation_step
validation_epoch_end
on_epoch_end

Now, in PL1.4, it is

on_train_epoch_start
training_step
training_step
training_step
training_step
on_validation_epoch_start
validation_step
validation_step
validation_step
validation_step
validation_epoch_end
on_epoch_end
training_epoch_end
on_epoch_end

i.e. training_epoch_end runs after validation_epoch_end instead of the last training_step, which doesn't make sense since on_epoch_end is 'just next to it'. Also, note the proximity of the two on_epoch_end in PL 1.4

To Reproduce

You can use the following Colab link:
https://colab.research.google.com/github/mmg10/pl_bug/blob/main/pl_bug_138.ipynb

https://colab.research.google.com/github/mmg10/pl_bug/blob/main/pl_bug_140.ipynb

Environment

PyTorch Lightning 1.3.8 and 1.4.0 respectively

Significance

In PL 1.3.8, we could get the average of training loss across batches via

def training_epoch_end(self, outputs):
    self.avg_train_loss = torch.stack([x['loss'] for x in outputs]).mean().item()

but now we can't, Note that we still can run the following

def validation_epoch_end(self, outputs):
     avg_valid_loss = torch.stack([x['loss'] for x in outputs]).mean().item()

since the validation_epoch_end is preceeded by the last validation_step

The text was updated successfully, but these errors were encountered:

tchaton · 2021-08-02T08:22:10Z

@carmocca Any idea there ?

Borda · 2021-08-02T08:33:19Z

looks similar to #8654

carmocca · 2021-08-02T10:19:23Z

The order was changed in #7357. See the linked PR for its reasoning.

but now we can't

Can you elaborate on why you can't anymore? Is it because you use the loss keyword during both training and validation?

mmgxa · 2021-08-03T03:08:55Z

@carmocca
I mentioned it. The outputs dictionary contains losses for all batches. It still does for the validation_epoch_end since it is run after the last validations_step. Not anymore for the training_epoch_end

ananthsub · 2021-08-03T05:59:43Z

@mmgxa - is it preferable for you to track the per-step results and reduce them in on_train_epoch_end or on_validation_epoch_end as you please? what are your thoughts on #8690 ?

tchaton · 2021-08-03T08:56:27Z

Dear @mmgxa,

I am not sure to follow how you can't get the loss on training_epoch_end.

This seem to work fine.

def test_epoch_end_hooks(tmpdir):

    seed_everything(42)

    class TestModel(BoringModel):

        def training_step(self, batch, batch_idx):
            loss = super().training_step(batch, batch_idx)
            loss["batch_idx"] = batch_idx
            return loss

        def validation_step(self, batch, batch_idx):
            loss = super().training_step(batch, batch_idx)
            loss["batch_idx"] =  -1 * batch_idx
            return loss

        def training_epoch_end(self, outputs) -> None:
            assert sum(x["loss"] for x in outputs).item() == 12.22606086730957
            assert sum(x["batch_idx"] for x in outputs) == sum(range(5))

        def validation_epoch_end(self, outputs) -> None:
            assert sum(x["loss"] for x in outputs).item() == 10.310195922851562
            assert sum(x["batch_idx"] for x in outputs) == -1 * sum(range(3))

    model = TestModel()
    trainer = Trainer(
        default_root_dir=tmpdir,
        max_epochs=1,
        limit_train_batches=5,
        limit_val_batches=3,
        num_sanity_val_steps=0,
    )
    trainer.fit(model)

mmgxa · 2021-08-04T03:58:45Z

@tchaton Not sure what this code does. But take a look at the following two notebooks and please try to explain why the results are different?

https://colab.research.google.com/github/mmg10/pl_bug/blob/main/pl_test_138.ipynb
https://colab.research.google.com/github/mmg10/pl_bug/blob/main/pl_test_140.ipynb

(Both train/valid loss should be the same as in the third cell - which is the case for 1.3.8, but not for 1.4.0. In PL 1.4, the train loss is 0 for first epoch, which is wrong, and in the second epoch, it reports the loss for the second step/batch only!)

mmgxa · 2021-08-04T04:03:32Z

@ananthsub but on_train_epoch_end doesn't support the output parameter and hence the loss can't be averaged like in training_epoch_end. One needs to add variable in the __init__ to keep track of it.

My thoughts on #8690? Well, since it doesn't make sense to have the training_epoch_end run after the validation steps (just think about the blocks in the first comment) - yeah, it's better to remove them once and for all 😏

awaelchli · 2021-08-04T17:16:50Z

@mmgxa so the reason you are seeing a different behavior is as you said the hook order changed. You were computing a self.avg_train_loss and then referencing that (printing it) in the validation_epoch_end. The only solution I can suggest right now is to compute a running average directly in the training_step so you will be able to get the value in the validation hooks.

Order of hooks is changed: Lightning-AI/pytorch-lightning#8670

mmgxa added bug Something isn't working help wanted Open to be worked on labels Aug 2, 2021

tchaton added the priority: 0 High priority task label Aug 2, 2021

carmocca added working as intended Working as intended and removed priority: 0 High priority task labels Aug 2, 2021

ananthsub mentioned this issue Aug 5, 2021

[RFC] Deprecate the _epoch_end hooks #8731

Closed

mmgxa closed this as completed Aug 5, 2021

davidgill97 added a commit to davidgill97/LightlySSL that referenced this issue Nov 10, 2023

Training metrics not logging:

873d040

Order of hooks is changed: Lightning-AI/pytorch-lightning#8670

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordering of hooks #8670

Ordering of hooks #8670

mmgxa commented Aug 2, 2021

tchaton commented Aug 2, 2021

Borda commented Aug 2, 2021

carmocca commented Aug 2, 2021

mmgxa commented Aug 3, 2021

ananthsub commented Aug 3, 2021

tchaton commented Aug 3, 2021 •

edited

Loading

mmgxa commented Aug 4, 2021

mmgxa commented Aug 4, 2021 •

edited

Loading

awaelchli commented Aug 4, 2021

Ordering of hooks #8670

Ordering of hooks #8670

Comments

mmgxa commented Aug 2, 2021

🐛 Bug

To Reproduce

Environment

Significance

tchaton commented Aug 2, 2021

Borda commented Aug 2, 2021

carmocca commented Aug 2, 2021

mmgxa commented Aug 3, 2021

ananthsub commented Aug 3, 2021

tchaton commented Aug 3, 2021 • edited Loading

mmgxa commented Aug 4, 2021

mmgxa commented Aug 4, 2021 • edited Loading

awaelchli commented Aug 4, 2021

tchaton commented Aug 3, 2021 •

edited

Loading

mmgxa commented Aug 4, 2021 •

edited

Loading