-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] The new symlink feature of ModelCheckpoint is broken when wrong link path #19294
Comments
@shenmishajing Thanks for the investigation. I am not sure that I understand the problem fully from your description. We could make the Regarding hardlinks, I vote against it. We need to support checkpoints as directories (i.e. FSDP/DeepSpeed sharded checkpoints) and hardlinks won't work for directories. Also, we would like that users can share their checkpoints from one filesystem to another, and hardlinks only work on the filesystem they were created on. For these reasons, I suggest to stay with symlinks. |
FYI, we backed away from linking as the default in #19191 recently. |
@awaelchli Yes, I think it will work. In the most case, it will work fine with just os.symlink(os.path.basename(filepath), linkpath) since, we have just created the checkpoint after the call of this func. But, it will be more complex to support the filepath and linkpath from everywhere. I will try to implement this.
Yeah, ok, I forget the sharded checkpoints, so the soft link will be better. |
That would be wonderful. Your help is very appreciated! |
@awaelchli Ok, I find that the
to os.symlink(os.path.relpath(filepath, os.path.dirname(linkpath)), linkpath) will fix this issue. Should I create a PR for it? And how about the idea of the |
Yes that sounds good. It would also be good to have a test case to show how it fails.
I suggest to open a separate feature issue about this request. I'd like to hear from others about this too. Currently, you can already achieve getting a best.ckpt by having a second model checkpoint callback. |
@awaelchli I have created a PR #19303 for this. However, I found that the |
@awaelchli I also create a new issue for it. Please see #19304. |
Yeah I can help with that, I will take a look in the next few hours |
Bug description
As demonstrated in the above figure, the
last.ckpt
is linked to a wrong path. The correct path is justepoch:299-val_loss:0.ckpt
instead of the long path fromwork_dirs
.https://github.com/Lightning-AI/pytorch-lightning/blob/628ee0cb61637d01a44884e0d805e238ea6b49bb/src/lightning/pytorch/callbacks/model_checkpoint.py#L692C25-L692C25
If we want to use the soft link, the above code should modify the
filepath
to start from thedirname
ofself._last_checkpoint_saved
, instead of using the unchangedfilepath
, which is started from the current working dir.In addition, though I agree with the idea of using the file link of
last.ckpt
to reduce the occupation of checkpoints on disk, it will be better to use the hard link instead of the soft link if possible on the local file system since the hard link will be more robust.And the hard link use the relative path from the working dir, so the fix to this issue will be simply modifying the code in
pytorch-lightning/src/lightning/pytorch/callbacks/model_checkpoint.py
Line 406 in 628ee0c
to
What version are you seeing the problem on?
v2.1
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
By the way, since we already implement the link of
last.ckpt
, how about adding a new feature onbest.ckpt
? Although we support saving the topk checkpoints, there is no consistent way to reference the top1 checkpoint. For example, in the above figure, I have to check the files under the folderwork_dirs/medclip_resnet50_iu_xray_R2GenGPT_30x/2024.01.16_17.44.40.927262/checkpoints
to determine the file nameepoch:5-val_loss:3.264.ckpt
to reference the best checkpoint if I have not created thebest.ckpt
checkpoint. It will be more convenient if we implement the hard link ofbest.ckpt
also. It can be implemented with the similar code as thelast.ckpt
part.cc @carmocca @awaelchli
The text was updated successfully, but these errors were encountered: