-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Add Loops Restart #8131
Closed
Closed
[Feat] Add Loops Restart #8131
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…s_everywhere_train
for more information, see https://pre-commit.ci
…g/pytorch-lightning into training_restart
…Lightning/pytorch-lightning into add_fast_forward_sampler
This PR is closed in favour of #8337 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #8126.
TODOS:
[x] Add restart for training epoch loop
[x] Add restart for training batch loop
[x] Add restart for multiple optimizers
[x] Add support random samplers
[x] Add support
accumulate_grad_batches
(saving gradients)[x] Add checkpoint creation on failure
[x] Add mechanism to prevent DDP deadlock on failure: Merged #8167
[x] Add restart for validation epoch loop
[x] Add restart for the validation batch loop
[x] Add restart for multiple validation dataloaders
[x] Add
val_check_internval
[x] Add
FastForwardSampler
andCaptureIterativeDataset
to recover dataset states. # #8307[] Add support for
results / extras
on epoch_end for train[] Add restore / restarting attribute function to Loops
[] Link progress tracking to Loop attributes
[] Investigate InfiniBatch for random seed (num_workers, ddp sampler)
[] Handle manual optimizer step (optimizer step / zero grad to take care of)
[] Test DistributedSampler
[] Add configuration mechanism to restart as currently supported. Start from next epoch.
[] Restart validation loop using
should_check_val
.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃