You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace HPC/SLURM-specific checkpointing with general checkpointing in CheckpointConnector, then deprecate it.
Motivation
CheckpointConnector has HPC/SLURM-specific checkpointing (save/load system) for auto-resubmit (doc).
Now that auto-resubmit is supported in normal checkpointing process (#4402), HPC auto-resubmit also can be handled by this general process.
In my opinion, the HPC-specific checkpointing ended its historical role.
By deprecating this specific checkpointing, CheckpointConnector can be refactored so simple and become easy to maintain.
Pitch
Deprecate hpc_save & hpc_load, which use hpc_ckpt_{ckpt_number}.ckpt name convention for auto-resume/resubmit.
Use general checkpointing, which attempt to use last.ckpt automatically, for SLURM auto-resubmit.
Backward compatibility
The deprecation break previously-generated checkpoint for auto-resubmit.
But auto-resubmit checkpoint is, in general, used within short-term.
In other words, the checkpoint is ephemeral.
And hpc_save and hpc_load are internal method (no public API in docs).
In this point of views, in my opinion, we can deprecate (internal) old checkpointing without deprecation warning/term.
The text was updated successfully, but these errors were encountered:
🚀 Feature
Replace HPC/SLURM-specific checkpointing with general checkpointing in
CheckpointConnector
, then deprecate it.Motivation
CheckpointConnector
has HPC/SLURM-specific checkpointing (save/load system) for auto-resubmit (doc).Now that auto-resubmit is supported in normal checkpointing process (#4402), HPC auto-resubmit also can be handled by this general process.
In my opinion, the HPC-specific checkpointing ended its historical role.
By deprecating this specific checkpointing,
CheckpointConnector
can be refactored so simple and become easy to maintain.Pitch
Deprecate
hpc_save
&hpc_load
, which usehpc_ckpt_{ckpt_number}.ckpt
name convention for auto-resume/resubmit.Use general checkpointing, which attempt to use
last.ckpt
automatically, for SLURM auto-resubmit.Backward compatibility
The deprecation break previously-generated checkpoint for auto-resubmit.
But auto-resubmit checkpoint is, in general, used within short-term.
In other words, the checkpoint is ephemeral.
And
hpc_save
andhpc_load
are internal method (no public API in docs).In this point of views, in my opinion, we can deprecate (internal) old checkpointing without deprecation warning/term.
The text was updated successfully, but these errors were encountered: