Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

last.ckpt is a soft link of best.ckpt #19189

Closed
Jinbo-Hu opened this issue Dec 20, 2023 · 5 comments · Fixed by #19191
Closed

last.ckpt is a soft link of best.ckpt #19189

Jinbo-Hu opened this issue Dec 20, 2023 · 5 comments · Fixed by #19191

Comments

@Jinbo-Hu
Copy link

Jinbo-Hu commented Dec 20, 2023

Bug description

When I save the last.ckpt, the last.ckpt is symlink of the best.ckpt. I have already set model_checkpoint.save_last=True.

What version are you seeing the problem on?

v2.1, master

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @carmocca @awaelchli

@Jinbo-Hu Jinbo-Hu added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 20, 2023
@awaelchli awaelchli added callback: model checkpoint question Further information is requested and removed needs triage Waiting to be triaged by maintainers bug Something isn't working labels Dec 20, 2023
@awaelchli
Copy link
Contributor

awaelchli commented Dec 20, 2023

The was a new feature in 2.1, not a bug. You can find some discussion here: #18995
Sorry if it came unexpected.

@Jinbo-Hu
Copy link
Author

But how do I save the latest or newest epoch? I just upgraded the version from 2.0.4 to 2.1.3.

@Jinbo-Hu
Copy link
Author

But how do I save the latest or newest epoch? I just upgraded the version from 2.0.4 to 2.1.3.

automatically

@Eleven1Liu
Copy link

Eleven1Liu commented Jan 6, 2024

In my case, when I set model_checkpoint.save_last=True, the last checkpoint links to best_model.ckpt.
However, best_model.ckpt is different from last.ckpt. In my understanding, best_model.ckpt is the best model you get by validation metric. last.ckpt is the model of the last epoch. They are not the same.

The second error we got was when we use

trainer.fit(model, train_loader)

without val_loader (training for a fixed number of epochs without validation)

We got OSError: [Error 40] Too many levels of symbolic links: '...' because last.ckpt is linked to last.ckpt.
I think this PR is helpful for us.

Or should we wait for the latest release of the following code:

ModelCheckpoint(save_last='copy')

We've upgraded our lightning version to 2.1.3 by pip install but still got the symbolic link version.
Thanks!

@bfs18
Copy link

bfs18 commented Feb 9, 2024

It is note fixed in 2.1.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants