Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always run validation inside the training loop epoch #7357

Merged
merged 40 commits into from
May 26, 2021

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented May 4, 2021

Recap for readers from the future:

In previous versions, depending on your training flags (namely val_check_interval), validation would either (1) start after the training batches have run but before the training epoch finishes or (2) after the training epoch finishes. pseudocode:

(1)

on_train_epoch_start()
for batch in train_dataloader():
    training_step()
    ...
on_train_epoch_end()

if should_run_validation:
    on_validation_epoch_start()
    for batch in val_dataloader():
        validation_step()
    on_validation_epoch_end()

(2)

on_train_epoch_start()
for batch in train_dataloader():
    training_step()
    ...
    if should_run_validation:
        on_validation_epoch_start()
        for batch in val_dataloader():
             validation_step()
        on_validation_epoch_end()
on_train_epoch_end()

This release changes the flow so that (2) is always used regardless of the trainer configuration. This has the advantage of facilitating the writing of callbacks since now it’s easier to support code that can run validation at the end of the training epoch or in the middle of it.

The breaking change comes from the fact that after this change, validation code/callbacks that relied on metrics logged or aggregated on epoch end will not work. This is to be expected as they will not be available when validation runs.

For example, if your early stopping callback was tracking a metric logged on training_epoch_end, you will need to set EarlyStopping(check_on_train_epoch_end=True).


What does this PR do?

The current training epoch structure follows the following pattern:
(max_epochs=1, limit_train_batches=1, limit_val_batches=1)

'on_fit_start',
    # training
    'on_train_start',
        'on_epoch_start',
        'on_train_epoch_start',
            'on_train_batch_start',
            'on_train_batch_end',
        'on_train_epoch_end',
        'on_epoch_end',
        # intra-training validation
        'on_validation_start',
            'on_epoch_start',
            'on_validation_epoch_start',
                'on_validation_batch_start',
                'on_validation_batch_end',
            'on_validation_epoch_end',
            'on_epoch_end',
        'on_validation_end',
    'on_train_end',
'on_fit_end',

Pros:

  • Clear distinction between on_train_epoch_end and on_train_end.

This means when validation is run at the end of the epoch (the classic loop structure), validation is considered to be out of the training scope. This is inconsistent with the pattern when validation is run in the middle of the epoch:

'on_fit_start',
    # training
    'on_train_start',
        'on_epoch_start',
        'on_train_epoch_start',
            'on_train_batch_start',
            'on_train_batch_end',
            # intra-training validation
            'on_validation_start',
                'on_epoch_start',
                'on_validation_epoch_start',
                    'on_validation_batch_start',
                    'on_validation_batch_end',
                'on_validation_epoch_end',
                'on_epoch_end',
            'on_validation_end',
        'on_train_epoch_end',
        'on_epoch_end',
    'on_train_end',
'on_fit_end',

This PR changes the code so the later pattern is always used.

Pros:

  • Consistent

Cons:

  • Both training and validation dataloaders are in memory
  • Messes up existing timing code
  • Cannot checkpoint or early-stop on metrics logged on_train_epoch_end or inside training_step via self.log(..., on_epoch=True). This is a breaking change. EarlyStopping users can set check_on_train_epoch_end=True to avoid the error.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@pep8speaks
Copy link

pep8speaks commented May 4, 2021

Hello @carmocca! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-26 00:55:54 UTC

@codecov
Copy link

codecov bot commented May 4, 2021

Codecov Report

Merging #7357 (1b18342) into master (d26953c) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #7357   +/-   ##
======================================
- Coverage      93%     92%   -0%     
======================================
  Files         199     199           
  Lines       12971   12960   -11     
======================================
- Hits        12001   11952   -49     
- Misses        970    1008   +38     

@carmocca carmocca force-pushed the refactor/global-step-update branch 2 times, most recently from 633ae0c to b156d49 Compare May 4, 2021 22:39
@carmocca carmocca force-pushed the refactor/global-step-update branch from b156d49 to df4d846 Compare May 4, 2021 23:32
@carmocca carmocca changed the title [WIP] Ignore - running tests [WIP] Always run validation inside the training loop epoch May 24, 2021
@carmocca carmocca self-assigned this May 24, 2021
@carmocca carmocca added design Includes a design discussion refactor labels May 24, 2021
@carmocca carmocca added this to the v1.4 milestone May 24, 2021
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the cleanup @carmocca !

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it a lot!

Two questions to the cons you mentioned around the loader logic:

  • Having the data loader in memory shouldn't be a problem as before you also had both datasets in memory (the memory footprint of the loader itself is very low)
  • Assuming you're talking about the worker processes: Shouldn't the train loader's worker processes vanish (of course only if set as non-persistent) as soon as they have finished iterating over the dataset (which would still be the case from my understanding)?

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
@carmocca
Copy link
Contributor Author

Two questions to the cons you mentioned around the loader logic ...

I haven't checked if this is happening but what if your batches are huge and you keep the last train batch in GPU while you run validation? This is what I was referring to.

@carmocca carmocca merged commit 311d9fe into master May 26, 2021
@carmocca carmocca deleted the refactor/global-step-update branch May 26, 2021 12:26
@carmocca carmocca mentioned this pull request May 26, 2021
8 tasks
awaelchli added a commit that referenced this pull request May 26, 2021
@mergify mergify bot added the ready PRs ready to be merged label Aug 2, 2021
@carmocca carmocca mentioned this pull request Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants