Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a custom FSDP strategy for benchmarking loads from boot disk #157

Merged
merged 19 commits into from
Oct 19, 2024

Conversation

abhibyreddi
Copy link
Collaborator

@abhibyreddi abhibyreddi commented Oct 17, 2024

This is needed to be able to benchmark checkpoint saves/loads to/from boot disk in distributed environment

  • When run with --save_only, lightning's FSDP strategy is used. Load calls are skipped.
  • When run with --load_only newly added LoadFromBootDiskFSDP is used which saves with dataflux. All nodes then copy all the contents of the GCS bucket to boot disk and then load their checkpoints. Avg. time to save one checkpoint will be reported as skipped.
  • Move all custom strategy classes to demo/lightning/checkpoint/multinode/strategies.py
  • Tests pass - ran manually on a VM with GPUs and on a GKE cluster with 2 nodes
  • Appropriate changes to documentation are included in the PR - follow up PR will update the README

@abhibyreddi abhibyreddi changed the title Customize FSDP strategy where all nodes write meta.pt to their checkpoint dirs Implement a custom FSDP strategy for benchmarking loads from boot disk Oct 18, 2024
@abhibyreddi abhibyreddi marked this pull request as ready for review October 18, 2024 22:43
@abhibyreddi abhibyreddi requested a review from a team as a code owner October 18, 2024 22:43
@abhibyreddi abhibyreddi requested review from MattIrv, bernardhan33, Yash9060 and jdnurme and removed request for bernardhan33 October 18, 2024 22:43
@abhibyreddi
Copy link
Collaborator Author

Contents of demo/lightning/checkpoint/multinode/strategies.py are mostly reviewed already except for the CustomFSDPStrategy class which is introduced in this PR.

@abhibyreddi abhibyreddi enabled auto-merge (squash) October 19, 2024 00:44
@abhibyreddi abhibyreddi merged commit d5651c8 into main Oct 19, 2024
5 checks passed
@abhibyreddi abhibyreddi deleted the abhibyreddi/custom-fsdp branch October 19, 2024 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants