[1/2] Deprecate `outputs` in `on_train_epoch_end` hooks #7339

ananthsub · 2021-05-04T04:50:30Z

What does this PR do?

This addresses part of #6865

Traditionally, the differentiator between LightningModule.training_epoch_end vs the on_train_epoch_end hook is that the training_epoch_end received all the batch outputs for the epoch from that rank for post-processing.

on_train_epoch_end took no arguments and didn't dictate whether the trainer should cache these outputs.

We deprecate outputs from on_train_epoch_end because:

We need a hook that runs at the end of the epoch which does not indicate to the trainer to cache outputs. Doing so can unintentionally inflate memory requirements and severely slow down training, putting training at risk of OOMs for large scale use cases. This is the primary performance concern.
Having both hooks which both run at the end of the epoch and both receive outputs is confusing for users: when should they use which? This is the secondary usability concern.

This PR checks these conditions for needing to store the per-batch results at the end of the epoch:

If the LightningModule overrides training_epoch_end
If the LightningModule overrides on_train_epoch_end and includes outputs in its signature (until v1.5)

The outputs were originally added here: #4369

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-05-04T04:50:34Z

Hello @ananthsub! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-05 12:25:47 UTC

codecov · 2021-05-04T04:51:41Z

Codecov Report

Merging #7339 (c84577d) into master (1a6dcbd) will decrease coverage by 4%.
The diff coverage is 98%.

@@           Coverage Diff           @@
##           master   #7339    +/-   ##
=======================================
- Coverage      92%     87%    -4%     
=======================================
  Files         200     200            
  Lines       12953   12985    +32     
=======================================
- Hits        11883   11360   -523     
- Misses       1070    1625   +555

ethanwharris

LGTM 😃 minor queries

pytorch_lightning/accelerators/accelerator.py

pytorch_lightning/callbacks/base.py

pytorch_lightning/trainer/callback_hook.py

pytorch_lightning/trainer/training_loop.py

pytorch_lightning/trainer/callback_hook.py

tchaton · 2021-05-04T07:32:34Z

pytorch_lightning/trainer/training_loop.py

+
+            # if the PL module doesn't have the hook then call the accelerator
+            # used to auto-reduce things for the user with Results obj
+            elif hasattr(self.trainer.accelerator, hook_name):


Not a huge fan of this. Better to use call_hook and maybe perform the signature analysis somewhere else.

from the comment, call_hook enforces that all of accelerator/trainer/module all take the exact same arguments for the hook, which might not be the case here. this was the same pattern @kaushikb11 followed in #6120

I'm not really a fan either, but call_hook is calling over 3 distinct interfaces which aren't enforced to be compatible.

maybe this is something we can look at for v1.4 is how to make to simplify/strengthen this? maybe the techniques @SkafteNicki used for metrics collections could apply here, but that seems beyond the scope of this PR

one thing I can do is add comments to Trainer.call_hook to indicate that there's this override being applied in training loop and any changes to call_hook must also be applied here.

SeanNaren

Thanks @ananthsub! I've strayed away from these hooks because of the caching logic and this is clearer

Co-authored-by: Ethan Harris <[email protected]>

pytorch_lightning/trainer/trainer.py

Co-authored-by: Jirka Borovec <[email protected]>

pytorch_lightning/accelerators/accelerator.py

pytorch_lightning/core/hooks.py

awaelchli · 2021-05-05T09:41:16Z

For callback implementers, if they need the outputs in the callback what do you suggest? cache through the batch_end callback methods?

ananthsub · 2021-05-05T13:31:32Z

For callback implementers, if they need the outputs in the callback what do you suggest? cache through the batch_end callback methods?

Yes, there are at least these options to support this:

If the callback is meant to work across multiple lightning modules and also have access to the outputs, the callback can cache the batch results at the end of each batch end hook. Downside: if multiple callbacks do this independently, overall memory is wasted because of redundant copies.
The LightningModule can cache the results in its *_epoch_end* method to make them available for callbacks to access. This would address the redundant copies from the point above, but requires more control over both module and callback.

the content of outputs is entirely LightningModule specific, so more of the logic should be moved closer to the module. What do you think of this?

Borda · 2021-05-05T14:36:15Z

@ananthsub can you pls add an example to docs how to proper use cache in this case?

ananthsub · 2021-05-05T14:41:48Z

@ananthsub can you pls add an example to docs how to proper use cache in this case?

on the docs site? in the callback/model hooks?

Borda · 2021-05-05T15:16:15Z

@ananthsub can you pls add an example to docs how to proper use cache in this case?

on the docs site? in the callback/model hooks?

where ever you feel it is better place :]
so lets merge this and add this doc in an extra 3/2 pr, ok?

awaelchli · 2021-05-05T15:17:08Z

@ananthsub Yes, I also see it that way and I think it's the most straightforward solution. Just wanted to make sure we know what to recommend when someone asks for this since we remove the feature that was requested. For everything we deprecate we should have a solution for people who rely on it.

ananthsub requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners May 4, 2021 04:50

ananthsub changed the title ~~Remove outputs from on_train_epoch_end~~ [wip] Deprecate outputs in on_train_epoch_end hooks May 4, 2021

ananthsub added design Includes a design discussion refactor labels May 4, 2021

ananthsub added this to the v1.3 milestone May 4, 2021

ananthsub changed the title ~~[wip] Deprecate outputs in on_train_epoch_end hooks~~ Deprecate outputs in on_train_epoch_end hooks May 4, 2021

ananthsub changed the title ~~Deprecate outputs in on_train_epoch_end hooks~~ Deprecate outputs in on_train_epoch_end hooks May 4, 2021

ananthsub changed the title ~~Deprecate outputs in on_train_epoch_end hooks~~ [1/2] Deprecate outputs in on_train_epoch_end hooks May 4, 2021

ananthsub linked an issue May 4, 2021 that may be closed by this pull request

Remove duplicate epoch_end hooks in the Lightning Module #6865

Closed

ananthsub mentioned this pull request May 4, 2021

[2/2] Remove outputs from evaluation epoch end hooks #7338

Merged

11 tasks

ethanwharris approved these changes May 4, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved

pytorch_lightning/callbacks/base.py Show resolved Hide resolved

pytorch_lightning/trainer/callback_hook.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

tchaton reviewed May 4, 2021

View reviewed changes

mergify bot added the has conflicts label May 4, 2021

SeanNaren approved these changes May 4, 2021

View reviewed changes

mergify bot removed the has conflicts label May 4, 2021

ananthsub added 6 commits May 4, 2021 08:05

Remove outputs from on_train_epoch_end

be39d5f

iterate

082c3cc

Update callback_hook.py

495c0b7

update

03fa99d

Update training_loop.py

2f55ff1

Update test_training_loop.py

8d262e1

ananthsub and others added 7 commits May 4, 2021 08:06

early stop?

b0c02cb

fix

274d5f8

update tests

5c6a3d8

Update test_hooks.py

01aed79

Update pytorch_lightning/trainer/callback_hook.py

b719bef

Co-authored-by: Ethan Harris <[email protected]>

Update pytorch_lightning/trainer/training_loop.py

66f310a

Co-authored-by: Ethan Harris <[email protected]>

Update trainer.py

d18455c

ananthsub force-pushed the fix-rm-outputs-train-epoch-end branch from 31314ee to d18455c Compare May 4, 2021 15:10

Borda reviewed May 4, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

Update pytorch_lightning/trainer/trainer.py

f2f8b58

Co-authored-by: Jirka Borovec <[email protected]>

Borda reviewed May 4, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved

pytorch_lightning/core/hooks.py Show resolved Hide resolved

edgarriba approved these changes May 5, 2021

View reviewed changes

mergify bot added the has conflicts label May 5, 2021

Borda added the _Will label May 5, 2021

Merge branch 'master' into fix-rm-outputs-train-epoch-end

c84577d

mergify bot removed the has conflicts label May 5, 2021

williamFalcon approved these changes May 5, 2021

View reviewed changes

Borda merged commit 6104a63 into Lightning-AI:master May 5, 2021

ananthsub mentioned this pull request May 5, 2021

Add documentation for ways to access all batch outputs for on_train_epoch_end hook #7389

Merged

11 tasks

awaelchli mentioned this pull request May 13, 2021

track_epoch_end_reduce_metrics and memory consumption #7498

Closed

carmocca mentioned this pull request Jul 27, 2021

Remove outputs in on_train_epoch_end hooks #8587

Merged

10 tasks

mpariente mentioned this pull request Dec 6, 2021

Fix unit-tests asteroid-team/asteroid#575

Closed

rsaite mentioned this pull request Dec 9, 2021

Pass epoch outputs to callback hooks on_validation_epoch_end and on_test_epoch_end #5508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/2] Deprecate `outputs` in `on_train_epoch_end` hooks #7339

[1/2] Deprecate `outputs` in `on_train_epoch_end` hooks #7339

ananthsub commented May 4, 2021 •

edited

Loading

pep8speaks commented May 4, 2021 •

edited

Loading

codecov bot commented May 4, 2021 •

edited

Loading

ethanwharris left a comment

tchaton May 4, 2021

ananthsub May 4, 2021 •

edited

Loading

SeanNaren left a comment

awaelchli commented May 5, 2021 •

edited by ananthsub

Loading

ananthsub commented May 5, 2021 •

edited

Loading

Borda commented May 5, 2021

ananthsub commented May 5, 2021

Borda commented May 5, 2021

awaelchli commented May 5, 2021

[1/2] Deprecate outputs in on_train_epoch_end hooks #7339

[1/2] Deprecate outputs in on_train_epoch_end hooks #7339

Conversation

ananthsub commented May 4, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented May 4, 2021 • edited Loading

Comment last updated at 2021-05-05 12:25:47 UTC

codecov bot commented May 4, 2021 • edited Loading

Codecov Report

ethanwharris left a comment

Choose a reason for hiding this comment

tchaton May 4, 2021

Choose a reason for hiding this comment

ananthsub May 4, 2021 • edited Loading

Choose a reason for hiding this comment

SeanNaren left a comment

Choose a reason for hiding this comment

awaelchli commented May 5, 2021 • edited by ananthsub Loading

ananthsub commented May 5, 2021 • edited Loading

Borda commented May 5, 2021

ananthsub commented May 5, 2021

Borda commented May 5, 2021

awaelchli commented May 5, 2021

[1/2] Deprecate `outputs` in `on_train_epoch_end` hooks #7339

[1/2] Deprecate `outputs` in `on_train_epoch_end` hooks #7339

ananthsub commented May 4, 2021 •

edited

Loading

pep8speaks commented May 4, 2021 •

edited

Loading

codecov bot commented May 4, 2021 •

edited

Loading

ananthsub May 4, 2021 •

edited

Loading

awaelchli commented May 5, 2021 •

edited by ananthsub

Loading

ananthsub commented May 5, 2021 •

edited

Loading