Always run validation inside the training loop epoch #7357

carmocca · 2021-05-04T21:25:14Z

Recap for readers from the future:

In previous versions, depending on your training flags (namely val_check_interval), validation would either (1) start after the training batches have run but before the training epoch finishes or (2) after the training epoch finishes. pseudocode:

(1)

on_train_epoch_start()
for batch in train_dataloader():
    training_step()
    ...
on_train_epoch_end()

if should_run_validation:
    on_validation_epoch_start()
    for batch in val_dataloader():
        validation_step()
    on_validation_epoch_end()

(2)

on_train_epoch_start()
for batch in train_dataloader():
    training_step()
    ...
    if should_run_validation:
        on_validation_epoch_start()
        for batch in val_dataloader():
             validation_step()
        on_validation_epoch_end()
on_train_epoch_end()

This release changes the flow so that (2) is always used regardless of the trainer configuration. This has the advantage of facilitating the writing of callbacks since now it’s easier to support code that can run validation at the end of the training epoch or in the middle of it.

The breaking change comes from the fact that after this change, validation code/callbacks that relied on metrics logged or aggregated on epoch end will not work. This is to be expected as they will not be available when validation runs.

For example, if your early stopping callback was tracking a metric logged on training_epoch_end, you will need to set EarlyStopping(check_on_train_epoch_end=True).

What does this PR do?

The current training epoch structure follows the following pattern:
(max_epochs=1, limit_train_batches=1, limit_val_batches=1)

'on_fit_start',
    # training
    'on_train_start',
        'on_epoch_start',
        'on_train_epoch_start',
            'on_train_batch_start',
            'on_train_batch_end',
        'on_train_epoch_end',
        'on_epoch_end',
        # intra-training validation
        'on_validation_start',
            'on_epoch_start',
            'on_validation_epoch_start',
                'on_validation_batch_start',
                'on_validation_batch_end',
            'on_validation_epoch_end',
            'on_epoch_end',
        'on_validation_end',
    'on_train_end',
'on_fit_end',

Pros:

Clear distinction between on_train_epoch_end and on_train_end.

This means when validation is run at the end of the epoch (the classic loop structure), validation is considered to be out of the training scope. This is inconsistent with the pattern when validation is run in the middle of the epoch:

'on_fit_start',
    # training
    'on_train_start',
        'on_epoch_start',
        'on_train_epoch_start',
            'on_train_batch_start',
            'on_train_batch_end',
            # intra-training validation
            'on_validation_start',
                'on_epoch_start',
                'on_validation_epoch_start',
                    'on_validation_batch_start',
                    'on_validation_batch_end',
                'on_validation_epoch_end',
                'on_epoch_end',
            'on_validation_end',
        'on_train_epoch_end',
        'on_epoch_end',
    'on_train_end',
'on_fit_end',

This PR changes the code so the later pattern is always used.

Pros:

Consistent

Cons:

Both training and validation dataloaders are in memory
Messes up existing timing code
Cannot checkpoint or early-stop on metrics logged on_train_epoch_end or inside training_step via self.log(..., on_epoch=True). This is a breaking change. EarlyStopping users can set check_on_train_epoch_end=True to avoid the error.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

pep8speaks · 2021-05-04T21:25:17Z

Hello @carmocca! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-26 00:55:54 UTC

codecov · 2021-05-04T21:26:42Z

Codecov Report

Merging #7357 (1b18342) into master (d26953c) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #7357   +/-   ##
======================================
- Coverage      93%     92%   -0%     
======================================
  Files         199     199           
  Lines       12971   12960   -11     
======================================
- Hits        12001   11952   -49     
- Misses        970    1008   +38

…p-update

CHANGELOG.md

ananthsub

thanks for the cleanup @carmocca !

pytorch_lightning/trainer/training_loop.py

justusschock

I like it a lot!

Two questions to the cons you mentioned around the loader logic:

Having the data loader in memory shouldn't be a problem as before you also had both datasets in memory (the memory footprint of the loader itself is very low)
Assuming you're talking about the worker processes: Shouldn't the train loader's worker processes vanish (of course only if set as non-persistent) as soon as they have finished iterating over the dataset (which would still be the case from my understanding)?

awaelchli

great stuff

pytorch_lightning/trainer/training_loop.py

carmocca · 2021-05-26T11:12:31Z

Two questions to the cons you mentioned around the loader logic ...

I haven't checked if this is happening but what if your batches are huge and you keep the last train batch in GPU while you run validation? This is what I was referring to.

carmocca force-pushed the refactor/global-step-update branch 2 times, most recently from 633ae0c to b156d49 Compare May 4, 2021 22:39

Refactor global step update

df4d846

carmocca force-pushed the refactor/global-step-update branch from b156d49 to df4d846 Compare May 4, 2021 23:32

carmocca added 5 commits May 6, 2021 17:55

WIP

8e402a4

Merge branch 'master' into refactor/global-step-update

9f20905

WIP

2404d58

WIP

3e5e087

Fix tests

d274165

carmocca changed the title ~~[WIP] Ignore - running tests~~ [WIP] Always run validation inside the training loop epoch May 24, 2021

carmocca self-assigned this May 24, 2021

carmocca added design Includes a design discussion refactor labels May 24, 2021

carmocca added this to the v1.4 milestone May 24, 2021

Fix tests

517970a

carmocca mentioned this pull request May 24, 2021

Fix global step update when the epoch is skipped #7677

Merged

10 tasks

carmocca added 2 commits May 24, 2021 17:35

Fix tests

c8500d7

Minor change

0c2305d

carmocca mentioned this pull request May 24, 2021

Refactor some loops code and hook tests #7682

Merged

8 tasks

carmocca added 2 commits May 25, 2021 01:40

Fix test

f02a866

Increment the total batch idx before the accumulation early exit

fa31597

carmocca mentioned this pull request May 24, 2021

Increment the total batch idx before the accumulation early exit #7692

Merged

9 tasks

carmocca added 6 commits May 25, 2021 01:52

Update CHANGELOG

64c49c1

Merge branch 'bugfix/total-batch-idx-update' into refactor/global-ste…

db09f43

…p-update

Fix test

a3d328f

Comment

5a00cec

Fix test

ca7804f

Fix ModelCheckpoint tests

d91eaf4

carmocca requested review from awaelchli, Borda, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners May 25, 2021 23:43

awaelchli reviewed May 26, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

carmocca added 3 commits May 26, 2021 02:54

Update CHANGELOG

09f5d39

Update CHANGELOG

95eeda1

Update CHANGELOG

1b18342

ananthsub approved these changes May 26, 2021

View reviewed changes

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

justusschock approved these changes May 26, 2021

View reviewed changes

awaelchli approved these changes May 26, 2021

View reviewed changes

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

carmocca merged commit 311d9fe into master May 26, 2021

carmocca deleted the refactor/global-step-update branch May 26, 2021 12:26

carmocca mentioned this pull request May 26, 2021

Remove check_checkpoint_callback #7724

Merged

8 tasks

awaelchli added a commit that referenced this pull request May 26, 2021

integrate #7357

8378f1c

harupy mentioned this pull request May 27, 2021

Fix pytorch-lightning dev testing mlflow/mlflow#4398

Merged

27 tasks

justusschock mentioned this pull request May 27, 2021

Add Test for memory consumption #7733

Merged

carmocca mentioned this pull request May 27, 2021

[WIP] Fix/lr schedulers update calling order #7708

Closed

11 tasks

awaelchli mentioned this pull request May 27, 2021

lr_scheduler.step() doesn't run on every epoch on "epoch" mode. #7737

Closed

Lucklyric mentioned this pull request May 28, 2021

LR scheduler steps after saving checkpoint with iteration-based checkpointing #7637

Closed

awaelchli mentioned this pull request May 31, 2021

training_epoch_end called before all steps of epoch were completed. always at about 0.25 size of steps. #7775

Closed

carmocca mentioned this pull request Jul 5, 2021

Default EarlyStopping.check_on_train_epoch_end=True #8286

Merged

9 tasks

chanshing mentioned this pull request Jul 30, 2021

Validation code runs before training_epoch_end() in v1.4 and breaks SWA #8654

Closed

mergify bot added the ready PRs ready to be merged label Aug 2, 2021

carmocca mentioned this pull request Aug 2, 2021

Ordering of hooks #8670

Closed

mirkobronzi mentioned this pull request Aug 2, 2021

Consider hook order change when updating PyTorch Lightning mila-iqia/cookiecutter-pyml#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always run validation inside the training loop epoch #7357

Always run validation inside the training loop epoch #7357

carmocca commented May 4, 2021 •

edited

Loading

pep8speaks commented May 4, 2021 •

edited

Loading

codecov bot commented May 4, 2021 •

edited

Loading

ananthsub left a comment

justusschock left a comment •

edited

Loading

awaelchli left a comment

carmocca commented May 26, 2021

Always run validation inside the training loop epoch #7357

Always run validation inside the training loop epoch #7357

Conversation

carmocca commented May 4, 2021 • edited Loading

Recap for readers from the future:

What does this PR do?

Before submitting

PR review

pep8speaks commented May 4, 2021 • edited Loading

Comment last updated at 2021-05-26 00:55:54 UTC

codecov bot commented May 4, 2021 • edited Loading

Codecov Report

ananthsub left a comment

Choose a reason for hiding this comment

justusschock left a comment • edited Loading

Choose a reason for hiding this comment

awaelchli left a comment

Choose a reason for hiding this comment

carmocca commented May 26, 2021

carmocca commented May 4, 2021 •

edited

Loading

pep8speaks commented May 4, 2021 •

edited

Loading

codecov bot commented May 4, 2021 •

edited

Loading

justusschock left a comment •

edited

Loading