Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace/deprecate SLURM specific checkpointing #5373

Closed
tarepan opened this issue Jan 6, 2021 · 3 comments
Closed

Replace/deprecate SLURM specific checkpointing #5373

tarepan opened this issue Jan 6, 2021 · 3 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on refactor

Comments

@tarepan
Copy link
Contributor

tarepan commented Jan 6, 2021

🚀 Feature

Replace HPC/SLURM-specific checkpointing with general checkpointing in CheckpointConnector, then deprecate it.

Motivation

CheckpointConnector has HPC/SLURM-specific checkpointing (save/load system) for auto-resubmit (doc).
Now that auto-resubmit is supported in normal checkpointing process (#4402), HPC auto-resubmit also can be handled by this general process.
In my opinion, the HPC-specific checkpointing ended its historical role.
By deprecating this specific checkpointing, CheckpointConnector can be refactored so simple and become easy to maintain.

Pitch

Deprecate hpc_save & hpc_load, which use hpc_ckpt_{ckpt_number}.ckpt name convention for auto-resume/resubmit.
Use general checkpointing, which attempt to use last.ckpt automatically, for SLURM auto-resubmit.

Backward compatibility

The deprecation break previously-generated checkpoint for auto-resubmit.
But auto-resubmit checkpoint is, in general, used within short-term.
In other words, the checkpoint is ephemeral.

And hpc_save and hpc_load are internal method (no public API in docs).

In this point of views, in my opinion, we can deprecate (internal) old checkpointing without deprecation warning/term.

@tarepan tarepan added feature Is an improvement or enhancement help wanted Open to be worked on labels Jan 6, 2021
@tarepan tarepan changed the title Deprecate HPC/SLURM specific checkpointing Replace/deprecate SLURM specific checkpointing Jan 6, 2021
@awaelchli awaelchli added this to the 1.2 milestone Jan 6, 2021
@tarepan
Copy link
Contributor Author

tarepan commented Jan 18, 2021

Are there any opinion, contributors?
If core contributors agree, I will make PR for this issue.

@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021
@edenlightning edenlightning removed this from the v1.3 milestone Apr 27, 2021
@edenlightning
Copy link
Contributor

@PyTorchLightning/core-contributors ?

@carmocca
Copy link
Contributor

The idea looks good to me overall.

This issue is pretty much a duplicate of #6204

Closing this one even though it was created first since it contains less activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on refactor
Projects
None yet
Development

No branches or pull requests

4 participants