Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to save last checkpoint as copy instead of symlinking #18995

Closed
ad12 opened this issue Nov 13, 2023 · 8 comments
Closed

Option to save last checkpoint as copy instead of symlinking #18995

ad12 opened this issue Nov 13, 2023 · 8 comments
Labels
callback: model checkpoint feature Is an improvement or enhancement

Comments

@ad12
Copy link

ad12 commented Nov 13, 2023

Description & Motivation

Saving the last.ckpt as a symlink on local file systems makes a lot of sense for most workflows. However, in a several cases, users often back up their checkpoints to cloud storage (AWS, GCP, etc.). In these scenarios, it is difficult to manage symlinks because they are often an all-or-nothing upload -- i.e. we cannot choose which symlinks to upload without being highly prescriptive on upload.

Checkpoints, especially last.ckpt, are critical for resuming runs, fine-tuning, etc. So we often want to back these up. However, when last.ckpt is a symlink, the backup process to cloud becomes much more involved.

Pitch

Add option `save_last=copy', where we save a copy of the last checkpoint

Alternatives

No response

Additional context

No response

cc @Borda @carmocca @awaelchli

@ad12 ad12 added feature Is an improvement or enhancement needs triage Waiting to be triaged by maintainers labels Nov 13, 2023
@carmocca carmocca added callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Nov 21, 2023
@carmocca
Copy link
Contributor

I didn't fully understand the motivation. What if you added a step at the end of your script that moves/copies the symlink target to the symlink location? This should give you the behaviour that you want without having to copy every in-between last checkpoint along the way.
Do you see any problems with this alternative?

@awaelchli
Copy link
Contributor

I'm also wondering what the challenge is here. A symbolic link is a file too, so it can be backed up. And it (normally) is a relative path, so if you download the checkpoint folder from your backup to a different location, the link will continue to just work.

@ad12
Copy link
Author

ad12 commented Nov 23, 2023

there are several symlink files that can be output to an experiment folder (e.g. W&B run symlinks). Managing symlinks independently of one another is an additional overhead for certain libraries (e.g. aws cli, rclone, etc.)

Having an option save_last=copy would allow defaulting to original behavior of checkpointing last.ckpt with minimal implementation overhead.

@ad12
Copy link
Author

ad12 commented Nov 23, 2023

What` if you added a step at the end of your script that moves/copies the symlink target to the symlink location? This should give you the behaviour that you want without having to copy every in-between last checkpoint along the way.

it would be helpful to be able to do this during training. A workaround is to either

  1. have a different process that watches the checkpoints folder and runs a sync when things change
  2. have a callback that does the syncing every time a checkpoint is made

Both of these seem more involved than adding a save_last='copy' option, which may be useful to other users who may want to default to the old checkpointing functionality

@dljjqy

This comment was marked as abuse.

@bgswaroop
Copy link
Contributor

bgswaroop commented Jan 20, 2024

I'm also wondering what the challenge is here. A symbolic link is a file too, so it can be backed up. And it (normally) is a relative path, so if you download the checkpoint folder from your backup to a different location, the link will continue to just work.

Not all commands are designed to handle symlinks. For instance, I just faced an issue with pathlib:
from pathlib import Path
assert Path("last.ckpt").exists(), f'{last.ckpt} does not exists!'
This raises an assertion!

Therefore, in a workflow, when we want to check if a checkpoint path exists before loading from it, this will not work!

@bgswaroop
Copy link
Contributor

bgswaroop commented Jan 20, 2024

To add to my previous comment, it is possible to get the actual path using the resolve() method however...

To get the actual path from the symlink, we can use
resolved_path = Path("last.ckpt").resolve()

Let's say, that the user has renamed one of the parent directories in the path of last.ckpt. This means, resolved_path will always point to the wrong target. This effectively makes last.ckpt unaccessible.

I think it is a great option to consider this feature request. If required, I am willing to contribute to this issue. Although, I need to understard the entire workflow!

Looking forward to your sugestions!

@awaelchli
Copy link
Contributor

@bgswaroop This feature request is completed. And #19303 made the link consistently relative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback: model checkpoint feature Is an improvement or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants