Make Lightning checkpoint demo work with Bernard's GKE framework and with FSDP strategy #86
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This updates the existing Lightning checkpoint demo so that you can run it both locally and on GKE using Bernard's framework. I also made it possible to configure the strategy between DDP and FSDP and to run on either GPU (required for FSDP) or CPU.
I also wrote up a README.md file showing how to run it and documenting the current limitations.
Although the diff here shows
train.py
as an entirely new file it's really just movingdemo/lightning/lightning_checkpoint.py
into a new folder and adding some code to support the capabilities above.One next step I want to do is to pull out the common framework code from all the demos since right now it's duplicated between them, but I'll follow up on that in a separate PR.