-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClearML integration removes the best checkpoint after uploading to the server #9251
Comments
👋 Hello @kecsap, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://ultralytics.com or email [email protected]. RequirementsPython>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started: git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit. |
@thepycoder is the logging code deleting local checkpoints on upload? |
Oh... so this might explain why I have not seen any best.pt at all having trained with various model sizes from YoloV5 v6.2. I geniunely thought something was changed from 6.1 to 6.2 (we use the fixed releases for training due to some modifications in training code). However I also enabled clearML when v6.2 got released. but to their hosted service for now instead of local server. None of my trained models had a best.pt in the weights folder after training. Looking at the ClearML artificats of each run, I do see them being there now. |
@Denizzje no, the only reason best.pt would be missing is if you trained with --nosave or --noval. |
Just running the bleeding-edge master now. ClearML installed: $ python3 train.py --img 640 --batch 124 --epochs 3 --data XXX.yaml --weights yolov5s.pt --freeze 10
train: weights=yolov5s.pt, cfg=, data=XXX.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=124, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[10], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
...
3 epochs completed in 0.002 hours.
Optimizer stripped from runs/train/exp92/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp92/weights/best.pt, 14.8MB
...
Validating runs/train/exp92/weights/best.pt...
...
Results saved to runs/train/exp92
2022-09-02 11:49:55,696 - clearml.Task - INFO - Waiting to finish uploads
2022-09-02 11:49:55,801 - clearml.Task - INFO - Completed model upload to http://X.X.X.X:8081/YOLOv5/training.XXX/models/best.pt
2022-09-02 11:50:01,036 - clearml.Task - INFO - Finished uploading
$ ls runs/train/exp92/weights/best.pt
ls: cannot access 'runs/train/exp92/weights/best.pt': No such file or directory Repeating the same after $ python3 train.py --img 640 --batch 124 --epochs 3 --data XXX.yaml --weights yolov5s.pt --freeze 10
train: weights=yolov5s.pt, cfg=, data=XXX.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=3, batch_size=124, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[10], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
...
3 epochs completed in 0.002 hours.
Optimizer stripped from runs/train/exp93/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp93/weights/best.pt, 14.8MB
...
Validating runs/train/exp93/weights/best.pt...
...
Results saved to runs/train/exp93
$ ls runs/train/exp93/weights/best.pt
runs/train/exp93/weights/best.pt |
@kecsap thanks for the example! @thepycoder per the example in #9251 (comment) it appears the ClearML integration is moving or deleting best.pt after training completes. Can you try to reproduce and investigate a fix? Thanks! |
Hey @kecsap, Thank you so much for the reproducible example. That is very weird indeed. I'm looking into it! For others that might have the issue: you can still retrieve your best model by going to the experiment webui and downloading it there or by using the sdk
Not at a computer right now, but will fix asap! |
Hey @glenn-jocher, @kecsap I opened a PR with a fix. Sorry for the inconvenience! |
这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
@kecsap good news 😃! Your original issue may now be fixed ✅ in PR #9265. To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
Search before asking
YOLOv5 Component
Integrations
Bug
YOLOv5 Git is up-to-date.
After configuring a local ClearML server, everything works fine, but the integration removes the uploaded best checkpoint file (best.pt) from runs/train/expXX/weights/ after the training is finished and it is uploaded to the server.
Environment
Ubuntu 20.04
Minimal Reproducible Example
No response
Additional
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: