Remember the eval mode of submodules when switching trainer stages #18951

awaelchli · 2023-11-05T16:58:20Z

What does this PR do?

A common issue users are facing is that the loop calls train() on the LightningModule despite the user having frozen certain layers. For example,

user finetunes a model and freezes certain layers they don't want to train
user's model has a feature extractor they don't train

This leads to a surprise when the user finds out that their batch norm layers have changed statistics, even though they were set explicitly to eval() mode. To avoid this, the user has to learn that they should override the on_validation_model_eval() and on_validation_model_train() hooks in the module, but this is a detail difficult to find in our docs and get right. Most users who face this challenge end up on slack or GH to ask for help.

The PR makes the following changes to automate this for the user:

The validation loop captures the .training mode of every submodule before calling .eval() now. When the validation loop ends, and before switching to training, it restores the .training mode on all submodules to what it was before. This ensures that layers the user has chosen to be in eval mode remain in eval mode!
The fit-loop no longer calls .train() at the beginning with the same motivation: The user can now set a subset of their model to .eval() mode / freeze it explicitly in the LightningModule's __init__ without doing acrobatics with hooks, and the Trainer will respect it and preserve it (see the added test). Note: This is not a breaking change, because PyTorch's default is to have a model in .training=True mode.

📚 Documentation preview 📚: https://pytorch-lightning--18951.org.readthedocs.build/en/18951/

cc @Borda @justusschock @awaelchli

for more information, see https://pre-commit.ci

codecov · 2023-11-12T04:09:57Z

Codecov Report

Merging #18951 (e71ab68) into master (b80107e) will decrease coverage by 27%.
Report is 1 commits behind head on master.
The diff coverage is 100%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #18951      +/-   ##
==========================================
- Coverage      76%      48%     -27%     
==========================================
  Files         450      442       -8     
  Lines       36508    36383     -125     
==========================================
- Hits        27583    17572   -10011     
- Misses       8925    18811    +9886

carmocca

Does this PR replace #18826?

src/lightning/pytorch/utilities/model_helpers.py

tests/tests_pytorch/models/test_hooks.py

src/lightning/pytorch/loops/evaluation_loop.py

Co-authored-by: Carlos Mocholí <[email protected]>

…eval-mode

for more information, see https://pre-commit.ci

…eval-mode

for more information, see https://pre-commit.ci

…eval-mode

for more information, see https://pre-commit.ci

awaelchli · 2023-11-16T05:35:41Z

Does this PR replace #18826?

@carmocca Good find! Yes in fact it does, I verified thanks to the repro example that was posted there. After this PR lands, we could still use #18826 to add an additional test case for the tuner.

adosar · 2025-02-04T10:03:39Z

@awaelchli

This leads to a surprise when the user finds out that their batch norm layers have changed statistics, even though they were set explicitly to eval() mode. To avoid this, the user has to learn that they should override the on_validation_model_eval() and on_validation_model_train()

Since the problem was in the training loop which called train(), even if the user sets them in on_validation_model_{eval|train}, it would lead to incorrect behavior (before this PR). Am I right? (just asking for better understanding Lightning)

Also, currently in the docs it is shown that model.train() is called after the validation ends what was done in this PR.

# ...
for batch_idx, batch in enumerate(train_dataloader):
    loss = model.training_step(batch, batch_idx)
    loss.backward()
    # ...

    if validate_at_some_point:
        # disable grads + batchnorm + dropout
        torch.set_grad_enabled(False)
        model.eval()

        # ----------------- VAL LOOP ---------------
        for val_batch_idx, val_batch in enumerate(val_dataloader):
            val_out = model.validation_step(val_batch, val_batch_idx)
        # ----------------- VAL LOOP ---------------

        # enable grads + batchnorm + dropout
        torch.set_grad_enabled(True)
        model.train()

I would happy to open a new issue for fixing the docs to better align with this very useful PR!

add utility function

cee874b

github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 5, 2023

pre-commit-ci bot and others added 8 commits November 5, 2023 16:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

eaa215f

for more information, see https://pre-commit.ci

Merge branch 'master' into feature/eval-mode

f299e9a

refactor

43bdbf8

docstrings

a488ed0

integrate in evaluation loop

46d0bc2

integrate into predict loop

4c85daa

update test_hooks

70cda85

[pre-commit.ci] auto fixes from pre-commit.com hooks

610f905

for more information, see https://pre-commit.ci

awaelchli added fun Staff contributions outside working hours - to differentiate from the "community" label trainer: validate trainer: fit labels Nov 12, 2023

awaelchli added this to the 2.2 milestone Nov 12, 2023

awaelchli added the feature Is an improvement or enhancement label Nov 12, 2023

awaelchli added 6 commits November 12, 2023 03:36

typing

629b8cf

add test

252fd0f

update docs

41ddcaf

update test

b1e9213

extend test

02f3792

changelog

dfec13e

awaelchli changed the title ~~[WIP] Remember the eval mode of submodules when switching trainer stages~~ Remember the eval mode of submodules when switching trainer stages Nov 12, 2023

github-actions bot added the docs Documentation related label Nov 12, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

8acfebb

for more information, see https://pre-commit.ci

awaelchli marked this pull request as ready for review November 12, 2023 03:28

awaelchli requested review from williamFalcon, tchaton, carmocca, justusschock and Borda as code owners November 12, 2023 03:28

carmocca reviewed Nov 12, 2023

View reviewed changes

awaelchli and others added 16 commits November 15, 2023 11:37

Merge branch 'master' into feature/eval-mode

7c781a6

Update src/lightning/pytorch/loops/evaluation_loop.py

7c534e8

Co-authored-by: Carlos Mocholí <[email protected]>

Merge remote-tracking branch 'origin/feature/eval-mode' into feature/…

2a4b18b

…eval-mode

Carlos suggestion

13b4aad

Carlos suggestion

401a41c

[pre-commit.ci] auto fixes from pre-commit.com hooks

b910655

for more information, see https://pre-commit.ci

test

c11a423

add test

95ee83c

Merge remote-tracking branch 'origin/feature/eval-mode' into feature/…

66127cd

…eval-mode

[pre-commit.ci] auto fixes from pre-commit.com hooks

a34d361

for more information, see https://pre-commit.ci

add test

553a95b

Merge remote-tracking branch 'origin/feature/eval-mode' into feature/…

1358799

…eval-mode

update

5c31248

tests

8a12771

[pre-commit.ci] auto fixes from pre-commit.com hooks

25995fb

for more information, see https://pre-commit.ci

add model summary test

95ed4ae

awaelchli requested a review from carmocca November 16, 2023 05:30

carmocca approved these changes Nov 16, 2023

View reviewed changes

Borda approved these changes Nov 16, 2023

View reviewed changes

mergify bot added the ready PRs ready to be merged label Nov 16, 2023

Merge branch 'master' into feature/eval-mode

e71ab68

awaelchli merged commit 3d448ac into master Nov 16, 2023

awaelchli deleted the feature/eval-mode branch November 16, 2023 21:32

carmocca mentioned this pull request Nov 21, 2023

The training mode is accidentally enabled in training_step function. #18781

Closed

mszulc913 mentioned this pull request Apr 26, 2024

Add a warning when some of the modules are in eval mode before the training stage #19820

Closed

awaelchli mentioned this pull request Jul 19, 2024

What happens during training with HuggingFace models in eval mode? #20105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remember the eval mode of submodules when switching trainer stages #18951

Remember the eval mode of submodules when switching trainer stages #18951

awaelchli commented Nov 5, 2023 •

edited

Loading

codecov bot commented Nov 12, 2023 •

edited

Loading

carmocca left a comment

awaelchli commented Nov 16, 2023

adosar commented Feb 4, 2025 •

edited

Loading

Remember the eval mode of submodules when switching trainer stages #18951

Remember the eval mode of submodules when switching trainer stages #18951

Conversation

awaelchli commented Nov 5, 2023 • edited Loading

What does this PR do?

codecov bot commented Nov 12, 2023 • edited Loading

Codecov Report

carmocca left a comment

Choose a reason for hiding this comment

awaelchli commented Nov 16, 2023

adosar commented Feb 4, 2025 • edited Loading

awaelchli commented Nov 5, 2023 •

edited

Loading

codecov bot commented Nov 12, 2023 •

edited

Loading

adosar commented Feb 4, 2025 •

edited

Loading