Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The new symlink feature of ModelCheckpoint is broken when wrong link path #19294

Closed
shenmishajing opened this issue Jan 16, 2024 · 9 comments · Fixed by #19303
Closed

Comments

@shenmishajing
Copy link
Contributor

shenmishajing commented Jan 16, 2024

Bug description

Screenshot 2024-01-16 at 18 30 55

As demonstrated in the above figure, the last.ckpt is linked to a wrong path. The correct path is just epoch:299-val_loss:0.ckpt instead of the long path from work_dirs.

https://github.com/Lightning-AI/pytorch-lightning/blob/628ee0cb61637d01a44884e0d805e238ea6b49bb/src/lightning/pytorch/callbacks/model_checkpoint.py#L692C25-L692C25

If we want to use the soft link, the above code should modify the filepath to start from the dirname of self._last_checkpoint_saved, instead of using the unchanged filepath, which is started from the current working dir.

In addition, though I agree with the idea of using the file link of last.ckpt to reduce the occupation of checkpoints on disk, it will be better to use the hard link instead of the soft link if possible on the local file system since the hard link will be more robust.

And the hard link use the relative path from the working dir, so the fix to this issue will be simply modifying the code in

to

                os.link(filepath, linkpath)

What version are you seeing the problem on?

v2.1

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
    • available: True
    • version: 12.1
  • Lightning:
    • lightning: 2.1.3
    • lightning-template: 1.5.1
    • lightning-utilities: 0.10.0
    • pytorch-lightning: 2.1.3
    • torch: 2.1.2
    • torchmetrics: 1.2.1
    • torchvision: 0.16.2
  • Packages:
    • accessible-pygments: 0.0.4
    • aiohttp: 3.9.1
    • aiosignal: 1.3.1
    • alabaster: 0.7.16
    • annotated-types: 0.6.0
    • antlr4-python3-runtime: 4.9.3
    • anyascii: 0.3.2
    • anyio: 4.2.0
    • appdirs: 1.4.4
    • argcomplete: 3.1.6
    • astroid: 3.0.2
    • attrs: 23.2.0
    • babel: 2.14.0
    • beautifulsoup4: 4.12.2
    • build: 1.0.3
    • cachetools: 5.3.2
    • certifi: 2023.11.17
    • cfgv: 3.4.0
    • chardet: 5.2.0
    • charset-normalizer: 3.3.2
    • click: 8.1.7
    • colorama: 0.4.6
    • commitizen: 3.13.0
    • contourpy: 1.2.0
    • coverage: 7.4.0
    • cycler: 0.12.1
    • decli: 0.6.1
    • distlib: 0.3.8
    • distro: 1.9.0
    • docker-pycreds: 0.4.0
    • docstring-parser: 0.15
    • docutils: 0.20.1
    • filelock: 3.13.1
    • fonttools: 4.47.0
    • frozenlist: 1.4.1
    • fsspec: 2023.12.2
    • gensim: 4.3.2
    • gitdb: 4.0.11
    • gitpython: 3.1.40
    • googletrans: 2.4.0
    • h11: 0.14.0
    • httpcore: 1.0.2
    • httpx: 0.26.0
    • huggingface-hub: 0.20.2
    • identify: 2.5.33
    • idna: 3.6
    • imagesize: 1.4.1
    • importlib-metadata: 6.11.0
    • importlib-resources: 6.1.1
    • iniconfig: 2.0.0
    • jinja2: 3.1.2
    • joblib: 1.3.2
    • jsonargparse: 4.27.1
    • jsonnet: 0.20.0
    • jsonschema: 4.20.0
    • jsonschema-specifications: 2023.12.1
    • kiwisolver: 1.4.5
    • lightning: 2.1.3
    • lightning-template: 1.5.1
    • lightning-utilities: 0.10.0
    • livereload: 2.6.3
    • logmatic-python: 0.1.7
    • markdown-it-py: 3.0.0
    • markupsafe: 2.1.3
    • matplotlib: 3.8.2
    • mdit-py-plugins: 0.4.0
    • mdurl: 0.1.2
    • medclip: 0.0.3
    • mpmath: 1.3.0
    • multidict: 6.0.4
    • myst-parser: 2.0.0
    • networkx: 3.2.1
    • nltk: 3.8.1
    • nodeenv: 1.8.0
    • numpy: 1.26.3
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 8.9.2.26
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-nccl-cu12: 2.18.1
    • nvidia-nvjitlink-cu12: 12.3.101
    • nvidia-nvtx-cu12: 12.1.105
    • omegaconf: 2.3.0
    • openai: 0.28.0
    • packaging: 23.2
    • pandas: 2.1.4
    • pillow: 10.2.0
    • pip: 23.3.1
    • platformdirs: 4.1.0
    • pluggy: 1.3.0
    • pre-commit: 3.6.0
    • project-template: 0.2.2.dev4+dirty
    • prompt-toolkit: 3.0.36
    • protobuf: 4.25.1
    • psutil: 5.9.7
    • pydantic: 2.5.3
    • pydantic-core: 2.14.6
    • pydata-sphinx-theme: 0.15.1
    • pygments: 2.17.2
    • pyparsing: 3.1.1
    • pyproject-api: 1.6.1
    • pyproject-hooks: 1.0.0
    • pytest: 7.4.4
    • pytest-cov: 4.1.0
    • python-dateutil: 2.8.2
    • python-json-logger: 2.0.7
    • pytorch-lightning: 2.1.3
    • pytz: 2023.3.post1
    • pyyaml: 6.0.1
    • questionary: 2.0.1
    • reconplogger: 4.12.0
    • referencing: 0.32.1
    • regex: 2023.12.25
    • requests: 2.31.0
    • rich: 13.7.0
    • rpds-py: 0.16.2
    • ruyaml: 0.91.0
    • safetensors: 0.4.1
    • scikit-learn: 1.3.2
    • scipy: 1.11.4
    • sentry-sdk: 1.39.1
    • setproctitle: 1.3.3
    • setuptools: 68.2.2
    • shell-command-launcher: 1.1.0
    • six: 1.16.0
    • smart-open: 6.4.0
    • smmap: 5.0.1
    • sniffio: 1.3.0
    • snowballstemmer: 2.2.0
    • socksio: 1.0.0
    • soupsieve: 2.5
    • speed-benchmark: 1.1.1
    • sphinx: 7.2.6
    • sphinx-autoapi: 3.0.0
    • sphinx-autobuild: 2021.3.14
    • sphinx-book-theme: 1.1.0
    • sphinx-design: 0.5.0
    • sphinxcontrib-applehelp: 1.0.7
    • sphinxcontrib-devhelp: 1.0.5
    • sphinxcontrib-htmlhelp: 2.0.4
    • sphinxcontrib-jsmath: 1.0.1
    • sphinxcontrib-qthelp: 1.0.6
    • sphinxcontrib-serializinghtml: 1.1.9
    • sympy: 1.12
    • termcolor: 2.4.0
    • textaugment: 2.0.0
    • textblob: 0.17.1
    • thop: 0.1.1.post2209072238
    • threadpoolctl: 3.2.0
    • timm: 0.9.12
    • tokenizers: 0.13.3
    • tomlkit: 0.12.3
    • torch: 2.1.2
    • torchmetrics: 1.2.1
    • torchvision: 0.16.2
    • tornado: 6.4
    • tox: 4.11.4
    • tqdm: 4.66.1
    • transformers: 4.24.0
    • triton: 2.1.0
    • typeshed-client: 2.4.0
    • typing-extensions: 4.9.0
    • tzdata: 2023.4
    • urllib3: 2.1.0
    • virtualenv: 20.25.0
    • wandb: 0.16.2
    • wcwidth: 0.2.13
    • wget: 3.2
    • wheel: 0.41.2
    • yarl: 1.9.4
    • zipp: 3.17.0
  • System:

More info

By the way, since we already implement the link of last.ckpt, how about adding a new feature on best.ckpt? Although we support saving the topk checkpoints, there is no consistent way to reference the top1 checkpoint. For example, in the above figure, I have to check the files under the folder work_dirs/medclip_resnet50_iu_xray_R2GenGPT_30x/2024.01.16_17.44.40.927262/checkpoints to determine the file name epoch:5-val_loss:3.264.ckpt to reference the best checkpoint if I have not created the best.ckpt checkpoint. It will be more convenient if we implement the hard link of best.ckpt also. It can be implemented with the similar code as the last.ckpt part.

cc @carmocca @awaelchli

@shenmishajing shenmishajing added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jan 16, 2024
@awaelchli
Copy link
Contributor

@shenmishajing Thanks for the investigation. I am not sure that I understand the problem fully from your description. We could make the linkpath relative to filepath before linking, would that fix the issue? A code example would be nice if you have one.

Regarding hardlinks, I vote against it. We need to support checkpoints as directories (i.e. FSDP/DeepSpeed sharded checkpoints) and hardlinks won't work for directories. Also, we would like that users can share their checkpoints from one filesystem to another, and hardlinks only work on the filesystem they were created on. For these reasons, I suggest to stay with symlinks.

@awaelchli
Copy link
Contributor

FYI, we backed away from linking as the default in #19191 recently.

@awaelchli awaelchli added callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Jan 16, 2024
@shenmishajing
Copy link
Contributor Author

shenmishajing commented Jan 16, 2024

@shenmishajing Thanks for the investigation. I am not sure that I understand the problem fully from your description. We could make the linkpath relative to filepath before linking, would that fix the issue? A code example would be nice if you have one.

@awaelchli Yes, I think it will work. In the most case, it will work fine with just

os.symlink(os.path.basename(filepath), linkpath) 

since, we have just created the checkpoint after the call of this func. But, it will be more complex to support the filepath and linkpath from everywhere. I will try to implement this.

Regarding hardlinks, I vote against it. We need to support checkpoints as directories (i.e. FSDP/DeepSpeed sharded checkpoints) and hardlinks won't work for directories. Also, we would like that users can share their checkpoints from one filesystem to another, and hardlinks only work on the filesystem they were created on. For these reasons, I suggest to stay with symlinks.

Yeah, ok, I forget the sharded checkpoints, so the soft link will be better.

@awaelchli
Copy link
Contributor

And, ok, I will try to implement this.

That would be wonderful. Your help is very appreciated!

@shenmishajing
Copy link
Contributor Author

@awaelchli Ok, I find that the os.path.relpath func already support to calculate a relative path from a given path, so it will be simple. Just change the code

to

                os.symlink(os.path.relpath(filepath, os.path.dirname(linkpath)), linkpath)

will fix this issue. Should I create a PR for it?

And how about the idea of the best.ckpt? I can implement it also.

@awaelchli
Copy link
Contributor

Should I create a PR for it?

Yes that sounds good. It would also be good to have a test case to show how it fails.

And how about the idea of the best.ckpt? I can implement it also.

I suggest to open a separate feature issue about this request. I'd like to hear from others about this too. Currently, you can already achieve getting a best.ckpt by having a second model checkpoint callback.

@shenmishajing
Copy link
Contributor Author

@awaelchli I have created a PR #19303 for this. However, I found that the tmp_path in the unit tests return a abs path starting from \tmp leading to the false positive test results, since the code only breaks when the link path and file path are relative paths. But, how can I add tests for those cases? I'm not familiar with pytest and may need your help.

@shenmishajing
Copy link
Contributor Author

I suggest to open a separate feature issue about this request. I'd like to hear from others about this too. Currently, you can already achieve getting a best.ckpt by having a second model checkpoint callback.

@awaelchli I also create a new issue for it. Please see #19304.

@awaelchli
Copy link
Contributor

leading to the false positive test results, since the code only breaks when the link path and file path are relative paths. But, how can I add tests for those cases? I'm not familiar with pytest and may need your help.

Yeah I can help with that, I will take a look in the next few hours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants