Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Lightning checkpoint demo work with Bernard's GKE framework and with FSDP strategy #86

Merged
merged 9 commits into from
Aug 8, 2024

Conversation

MattIrv
Copy link
Collaborator

@MattIrv MattIrv commented Aug 8, 2024

This updates the existing Lightning checkpoint demo so that you can run it both locally and on GKE using Bernard's framework. I also made it possible to configure the strategy between DDP and FSDP and to run on either GPU (required for FSDP) or CPU.

I also wrote up a README.md file showing how to run it and documenting the current limitations.

Although the diff here shows train.py as an entirely new file it's really just moving demo/lightning/lightning_checkpoint.py into a new folder and adding some code to support the capabilities above.

One next step I want to do is to pull out the common framework code from all the demos since right now it's duplicated between them, but I'll follow up on that in a separate PR.

@MattIrv MattIrv requested a review from a team as a code owner August 8, 2024 15:12
@MattIrv MattIrv enabled auto-merge (squash) August 8, 2024 16:57
@MattIrv MattIrv merged commit 1a5ebd8 into main Aug 8, 2024
1 of 2 checks passed
@MattIrv MattIrv deleted the mirvine/fsdp branch August 8, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants