Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning checkpoint #40

Merged
merged 17 commits into from
Jun 26, 2024
Merged

Lightning checkpoint #40

merged 17 commits into from
Jun 26, 2024

Conversation

divrawal
Copy link
Collaborator

@divrawal divrawal commented Jun 24, 2024

Lightning Checkpoint support in dataflux for PyTorch.

Includes

  • CheckpointIO interface implementation

  • Unit tests

  • Example file

  • Notebook

  • README change.

  • Tests pass

  • Appropriate changes to documentation are included in the PR

@divrawal divrawal requested a review from a team as a code owner June 24, 2024 21:31
@divrawal divrawal requested review from MattIrv and bernardhan33 June 24, 2024 21:31
Copy link
Collaborator

@bernardhan33 bernardhan33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on first contribution to dataflux-pytorch!

dataflux_pytorch/examples/lightning_checkpoint.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
dataflux_pytorch/examples/lightning_checkpoint.py Outdated Show resolved Hide resolved
dataflux_pytorch/examples/lightning_checkpoint.py Outdated Show resolved Hide resolved
@bernardhan33
Copy link
Collaborator

BTW the Kokoro test is failing, we will need to fix the test failure there.

dataflux_client_python Outdated Show resolved Hide resolved
)
self.bucket = self.storage_client.bucket(self.bucket_name)

def _parse_gcs_path(self, path: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the client is initialized with a project name and bucket name, why is this necessary? Could we just have the user provide a path which is relative to the bucket rather than starting with "gcs:///"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original implementation but after I wrote the demo/example file I saw that for automated checkpointing this needed to ingest a full path. The full path is what gets passed through to the trainer through the ModelCheckpoint object.

Copy link
Collaborator

@MattIrv MattIrv Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we update the demo to not pass in the gcs://<bucket>/ part of the path? Alternatively, could we remove the bucket from the constructor in this file? It seems duplicative as-is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it sounds like we are taking on this feature essentially, which I think it's great to have to make the user experience similar with gcsfs, etc. I'd also agree on just taking away buckets from the constructor. We'd also need work to support this for the two datasets though but definitely for different PRs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any code changes here, am I missing them? @divrawal can you follow up on it?

divrawal added 2 commits June 26, 2024 22:25
…y one object concerning, and for save_checkpoint modified so it utilizes a context manager
@divrawal divrawal merged commit 6376bc2 into main Jun 26, 2024
2 checks passed
@divrawal divrawal deleted the lightning-checkpoint branch June 26, 2024 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants