-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-mAP bug: P, R and mAP are all 0.0 #9059
Comments
👋 Hello @athrunsunny, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://ultralytics.com or email [email protected]. RequirementsPython>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started: git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit. |
It's a bug already happened the other day but @glenn-jocher fixed it, today reappeared |
Have you updated your repo? @glenn-jocher has created a PR for this yesterday. |
In my case, yesterday everything was working again, today bug returned. I clone the repository at every run in my Google Colab notebook. |
@athrunsunny @giuseppebrb @pourmand1376 I'm still seeing this myself, but it's not consistently reproducible. I trained 10 models in a row in Colab just now and the zero-mAP bug randomly appears in 2 out of 10. Importantly all val losses are identical, all single-gpu trainings are now 100% reproducible, so the bug is not in training, it is in the metrics apparently. I need help debugging this, I don't know what's causing the problem. # Train 10x YOLOv5s on COCO128 for 3 epochs
for _ in range(10):
!python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache @AyushExel FYI this is happening, don't have a solution yet. |
@glenn-jocher I will try to solve the problem. However, I don't know any commit which doesn't have this issue. Do you know any? |
@pourmand1376 yes I've been manually bisecting commits, but since it's not 100% reproducible I would only test one training, and the most recent commit that passed the one training I would revert to. This only appeared in the last few days. Some other observations:
|
This seems like a p0 problem. Let's prioritize this. |
@pourmand1376 so it seems like the git bisect approach is correct, but the test must be high statistics, i.e. run 30 trainings with 30 good results and we can be 97% sure there's no bug in that commit. |
@AyushExel no, last hyp changes were months ago. I think this might be related to EMA or running val.py directly from train.py or maybe AMP/FP16 ops between train/val. But the strangest part is that val losses are always identical across all models, regardless of whether map is 0 or correct. Also another observation is that calling val.py directly always works. The bug is in val during training only. In other words, there is absolutely nothing wrong with training or the models that are being trained, what's wrong is that val.py during training is erroneously reporting zero-mAP. |
Another thing I suspected was the introduction of identical filenames in /classify. This has never happened before in the repo. If that's the case then the bug would appear directly in the v6.2 tag. |
Hmm, I am using for _ in range(30):
!python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5x.pt --cache So far: !git bisect start
!git bisect bad
!git bisect good 20f1b7ea086fb317da93d2e603c4def2ebfb8187
!git bisect good fe809b8dad5236d86d5acbe047b5e0e6895b2b8a
!git bisect bad 4bc5520e9424cdb0bd73bffb091b85934d5096c8
!git bisect bad 27fb6fd8fc21c20290041f38046d7a60ae8c6e3a
!git bisect bad eb359c3a226f55c9b51efcfeae2e31c820e6e08a
!git bisect good 5c854fab5e43df82ebfd51197c2dc58e5212c5a6
!git bisect good d40cd0d454dcc34312cb5c40f45f64b76665c40c
!git bisect bad de6e6c0110adbb41f829c1288d5cdab7105892ae
!git bisect bad 61adf017f231f470afca2636f1f13e4cce13914b Once we understand bad commits, it would be a lot easier to detect the problem. |
@pourmand1376 oh, YOLOv5x will be slow. You can use the default Colab training command, the bug will appear there: for _ in range(10):
!python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache |
I trained v6.2 tag 15x and all trainings are correct. So it's not actually in the v6.2 release apparently but has been introduced in the 38 commits since then. |
I am updating the git bisect commits. Good ones are not for sure as I am running 4 concurrent colab sessions but bad ones are for sure. |
Resolves ultralytics/hub#82 Signed-off-by: Glenn Jocher <[email protected]>
I'm still on commit d1dfcab and no issues with loss, P, R, mAP, etc. And I've been training 2 models (one locally, and one on colab all night and morning). |
This is what bisect reports ... 61adf01 is the first bad commit
|
@pourmand1376 ohhhhhhh this makes sense since torch.empty is not reproducible! Awesome @pourmand1376 this is a huge help. I'll revert this commit immediately, the speed gains are super small from the torch.zeros to torch.empty switch in any case. |
@glenn-jocher @pourmand1376 quick question - do we know WHY this caused issues? From the docs I understand that it creates unintailized tensor with random values(explains non-reproduciblity) but I don't get why the metrics would be stuck at 0 ? |
@pourmand1376 thanks for the help!! I've pushed #9068 with a fix by @0zppd just now |
@AyushExel At least, they should put this on their docs. |
@giuseppebrb @athrunsunny @pourmand1376 @robotwhispering @AyushExel good news 😃! Your original issue may now be fixed ✅ in PR #9068 by @0zppd. @pourmand1376 tracked down the problem using The original issue was that I had replaced torch.zeros() with torch.empty() on some ops like warmup and profiling to try to get some slight speed improvements, and once op in particular ran a torch.empty() tensor through the model when it was in .train() mode, leading the batchnorm layers to add those values to the tracked statistics. Since torch.empty is not initialized it can take on extremely high or low values, leading some batchnorm layers to randomly output NaN. The PR has been extensively tested on 10x Colab trainings and all 10 came back good now: To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
@glenn-jocher @pourmand1376 @0zppd Thank you for your great work |
@shubhambagwari try 300 epochs. |
Thank you for your answer. I was having the same issue, it resolved after increasing epochs just beyond 3. |
Thanks for this, I quit my training a number of times to try and fix getting 0 P, R and mAP values in the first two epochs not realising that it would only start to increase by epoch 3 😆 |
Search before asking
Question
I train my custom one class dataset(about 40,000 pictures)use default cfg of yolov5s with the latest version(6.2),P, R and mAP are all 0,(It's the same after about 100 epochs)however, it is normal to use the previous version(6.1)and it's normal when I train with coco128,I think this may be a bug, but I didn't debug it.
Additional
No response
The text was updated successfully, but these errors were encountered: